Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Anomaly detection in data streams: challenges and techniques
    Salehi, Mahsa ( 2015)
    Anomaly detection in data streams plays a vital role in on-line data mining applications, such as network intrusion detection, environmental monitoring and road traffic analysis. However, there are significant challenges with anomaly detection in streaming environments and in this thesis we propose effective and efficient techniques to address these challenges. A major challenge for anomaly detection in these applications is the dynamically changing nature of these monitoring environments. This causes a problem for traditional anomaly detection techniques, which assume a relatively static monitoring environment, and hence construct a static model of normal behaviour as the basis for anomaly detection. However, in an environment that is intermittently changing (known as a switching data stream), such an approach can yield a high error rate in terms of false positives. To cope with the challenge of dynamic environments, we require an approach that can learn from the history of normal behaviour in a data stream, while accounting for the fact that not all time periods in the past are equally relevant. Consequently, to address this problem first we propose a relevance-weighted ensemble model for learning normal behaviour, which forms the basis of our anomaly detection scheme. The second challenge for anomaly detection in data streams is the high rate of incoming observations. Since traditional approaches in anomaly detection require multi-passes over datasets, they are not applicable to data streams. In terms of streaming data, processing each observation multiple times is not feasible due to the unbounded amount of data that is generated at a high rate. The advantage of our proposed relevance-weighted ensemble model is that it can improve the accuracy of detection by making use of relevant history, while remaining computationally efficient, and addresses both major challenges in data streams. We then propose two ensemble based approaches called Biased SubSampling (BSS) and Diversity-based Biased SubSampling (DBSS) for anomaly detection, where we improve the detection accuracy of each ensemble detector on one hand and induce diversity among them on the other hand, both in an unsupervised manner. We discuss the effectiveness of our approaches in terms of the bias-variance trade-off. Such an approach is effective in terms of improving the detection accuracy of outliers and can be potentially used in streaming data. With the growing need to analyze high speed data streams, the task of anomaly detection becomes even more challenging as traditional anomaly detection techniques can no longer assume that all the data can be stored for processing. This motivates our third major challenge in anomaly detection in data streams, i.e., the unbounded quantity of data points. To address this challenge we propose a memory efficient incremental local outlier detection algorithm for data streams called MiLOF, and a more flexible version called MiLOF_F, which have an accuracy close to incremental local outlier factor (iLOF) algorithm, a well-known density based outlier detection algorithm, but within a fixed memory bound and with lower time complexity. Hence, our proposed methods are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams. In addition, MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. While a variety of approaches have been widely used in supervised learning tasks, all of our approaches in this thesis provide novel contributions through the use of ensemble techniques and clustering models for anomaly detection in data streams in an unsupervised manner, which is the case in many real applications. Finally, this thesis is concluded by a case study of car racing driver distraction using EEG brain signals of drivers, as an example of of a potential application for anomaly detection in data streams.
  • Item
    Thumbnail Image
    Anomaly detection in participatory sensing networks
    MONAZAM ERFANI, SARAH ( 2015)
    Anomaly detection or outlier detection aims to identify unusual values in a given dataset. In particular, there is growing interest in collaborative anomaly detection, where multiple data sources submit their data to an online data mining service, in order to detect anomalies with respect to the wider population. By combining data from multiple sources, collaborative anomaly detection aims to improve detection accuracy through the construction of a more robust model of normal behaviour. Cloud-based collaborative architectures such as Participatory Sensing Networks (PSNs) provide an open distributed platform that enables participants to share and analyse their local data on a large scale. Two major issues with collaborative anomaly detection are how to ensure the privacy of participants’ data, and how to efficiently analyse the large-scale high-dimensional data collected in these networks. The first problem we address is the issue of data privacy in PSNs, by introducing a framework for privacy-preserving collaborative anomaly detection with efficient local data perturbation at participating nodes, and global processing of the perturbed records at a data mining server. The data perturbation scheme that we propose enables the participants to perturb their data independently without requiring the cooperation of other parties. As a result our privacy-preservation approach is scalable to large numbers of participants and is computationally efficient. By collecting the participants’ data, the PSN server can generate a global anomaly detection model from these locally perturbed records. The global model identifies interesting measurements or unusual patterns in participants’ data without revealing the true values of the measurements. In terms of privacy, the proposed scheme thwarts several major types of attacks, namely, the Independent Component Analysis (ICA), Distance-inference, Maximum a Posteriori (MAP), and Collusion attacks. We further improve the privacy of our data perturbation scheme by: (i) redesigning the nonlinear transformation to better defend against MAP estimation attacks for normal and anomalous records, and (ii) supporting individual random linear transformations for each participant in order to provide the system with greater resistance to malicious collusion. A notable advantage of our perturbation scheme is that it preserves participants’ privacy while achieving comparable accuracy to non-privacy preserving anomaly detection techniques. The second problem we address in the thesis is how to model and interpret the large volumes of high-dimensional data that are generated in participatory domains by using One-class Support Vector Machines (1SVMs). While 1SVMs are effective at producing decision surfaces for anomaly detection from well-behaved feature vectors, they can be inefficient at modelling the variations in large, high-dimensional datasets. We overcome this challenge by taking two different approaches. The first approach is an unsupervised hybrid architecture, in which a Deep Belief Network (DBN) is used to extract generic underlying features, in combination with a 1SVM that uses the features learned by the DBN. DBNs have important advantages as feature detectors for anomaly detection, as DBNs use unlabelled data to capture higher-order correlations among features. Furthermore, using a DBN to reduce the number of irrelevant and redundant features improves the scalability of a 1SVM for use with large training datasets containing high-dimensional records. Our hybrid approach is able to generate an accurate anomaly detection model with lower computational and memory complexity compared to a 1SVM on its own. Alternatively, to overcome the shortcomings of 1SVMs in processing high-dimensional datasets, in our second approach we calculate a lower rank approximation of the optimisation problem that underlies the 1SVM training task. Instead of performing the optimisation in a high-dimensional space, the optimisation is conducted in a space of reduced dimension but on a larger neighbourhood. We leverage the theory of nonlinear random projections and propose the Reduced 1SVM (R1SVM), which is an efficient and scalable anomaly detection technique that can be trained on large-scale datasets. The main objective of R1SVM is to replace a nonlinear machine with randomised features and a linear machine. In summary, we have proposed efficient privacy-preserving anomaly detection approaches for PSNs, and scalable data modelling approaches for high-dimensional datasets, which lower the computational and memory complexity compared to traditional anomaly detection techniques. We have shown that the proposed methods achieve higher or comparable accuracy in detecting anomalies compared to existing state-of-art techniques.