Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Anomaly detection in data streams: challenges and techniques
    Salehi, Mahsa ( 2015)
    Anomaly detection in data streams plays a vital role in on-line data mining applications, such as network intrusion detection, environmental monitoring and road traffic analysis. However, there are significant challenges with anomaly detection in streaming environments and in this thesis we propose effective and efficient techniques to address these challenges. A major challenge for anomaly detection in these applications is the dynamically changing nature of these monitoring environments. This causes a problem for traditional anomaly detection techniques, which assume a relatively static monitoring environment, and hence construct a static model of normal behaviour as the basis for anomaly detection. However, in an environment that is intermittently changing (known as a switching data stream), such an approach can yield a high error rate in terms of false positives. To cope with the challenge of dynamic environments, we require an approach that can learn from the history of normal behaviour in a data stream, while accounting for the fact that not all time periods in the past are equally relevant. Consequently, to address this problem first we propose a relevance-weighted ensemble model for learning normal behaviour, which forms the basis of our anomaly detection scheme. The second challenge for anomaly detection in data streams is the high rate of incoming observations. Since traditional approaches in anomaly detection require multi-passes over datasets, they are not applicable to data streams. In terms of streaming data, processing each observation multiple times is not feasible due to the unbounded amount of data that is generated at a high rate. The advantage of our proposed relevance-weighted ensemble model is that it can improve the accuracy of detection by making use of relevant history, while remaining computationally efficient, and addresses both major challenges in data streams. We then propose two ensemble based approaches called Biased SubSampling (BSS) and Diversity-based Biased SubSampling (DBSS) for anomaly detection, where we improve the detection accuracy of each ensemble detector on one hand and induce diversity among them on the other hand, both in an unsupervised manner. We discuss the effectiveness of our approaches in terms of the bias-variance trade-off. Such an approach is effective in terms of improving the detection accuracy of outliers and can be potentially used in streaming data. With the growing need to analyze high speed data streams, the task of anomaly detection becomes even more challenging as traditional anomaly detection techniques can no longer assume that all the data can be stored for processing. This motivates our third major challenge in anomaly detection in data streams, i.e., the unbounded quantity of data points. To address this challenge we propose a memory efficient incremental local outlier detection algorithm for data streams called MiLOF, and a more flexible version called MiLOF_F, which have an accuracy close to incremental local outlier factor (iLOF) algorithm, a well-known density based outlier detection algorithm, but within a fixed memory bound and with lower time complexity. Hence, our proposed methods are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams. In addition, MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. While a variety of approaches have been widely used in supervised learning tasks, all of our approaches in this thesis provide novel contributions through the use of ensemble techniques and clustering models for anomaly detection in data streams in an unsupervised manner, which is the case in many real applications. Finally, this thesis is concluded by a case study of car racing driver distraction using EEG brain signals of drivers, as an example of of a potential application for anomaly detection in data streams.
  • Item
    Thumbnail Image
    Anomaly detection in participatory sensing networks
    MONAZAM ERFANI, SARAH ( 2015)
    Anomaly detection or outlier detection aims to identify unusual values in a given dataset. In particular, there is growing interest in collaborative anomaly detection, where multiple data sources submit their data to an online data mining service, in order to detect anomalies with respect to the wider population. By combining data from multiple sources, collaborative anomaly detection aims to improve detection accuracy through the construction of a more robust model of normal behaviour. Cloud-based collaborative architectures such as Participatory Sensing Networks (PSNs) provide an open distributed platform that enables participants to share and analyse their local data on a large scale. Two major issues with collaborative anomaly detection are how to ensure the privacy of participants’ data, and how to efficiently analyse the large-scale high-dimensional data collected in these networks. The first problem we address is the issue of data privacy in PSNs, by introducing a framework for privacy-preserving collaborative anomaly detection with efficient local data perturbation at participating nodes, and global processing of the perturbed records at a data mining server. The data perturbation scheme that we propose enables the participants to perturb their data independently without requiring the cooperation of other parties. As a result our privacy-preservation approach is scalable to large numbers of participants and is computationally efficient. By collecting the participants’ data, the PSN server can generate a global anomaly detection model from these locally perturbed records. The global model identifies interesting measurements or unusual patterns in participants’ data without revealing the true values of the measurements. In terms of privacy, the proposed scheme thwarts several major types of attacks, namely, the Independent Component Analysis (ICA), Distance-inference, Maximum a Posteriori (MAP), and Collusion attacks. We further improve the privacy of our data perturbation scheme by: (i) redesigning the nonlinear transformation to better defend against MAP estimation attacks for normal and anomalous records, and (ii) supporting individual random linear transformations for each participant in order to provide the system with greater resistance to malicious collusion. A notable advantage of our perturbation scheme is that it preserves participants’ privacy while achieving comparable accuracy to non-privacy preserving anomaly detection techniques. The second problem we address in the thesis is how to model and interpret the large volumes of high-dimensional data that are generated in participatory domains by using One-class Support Vector Machines (1SVMs). While 1SVMs are effective at producing decision surfaces for anomaly detection from well-behaved feature vectors, they can be inefficient at modelling the variations in large, high-dimensional datasets. We overcome this challenge by taking two different approaches. The first approach is an unsupervised hybrid architecture, in which a Deep Belief Network (DBN) is used to extract generic underlying features, in combination with a 1SVM that uses the features learned by the DBN. DBNs have important advantages as feature detectors for anomaly detection, as DBNs use unlabelled data to capture higher-order correlations among features. Furthermore, using a DBN to reduce the number of irrelevant and redundant features improves the scalability of a 1SVM for use with large training datasets containing high-dimensional records. Our hybrid approach is able to generate an accurate anomaly detection model with lower computational and memory complexity compared to a 1SVM on its own. Alternatively, to overcome the shortcomings of 1SVMs in processing high-dimensional datasets, in our second approach we calculate a lower rank approximation of the optimisation problem that underlies the 1SVM training task. Instead of performing the optimisation in a high-dimensional space, the optimisation is conducted in a space of reduced dimension but on a larger neighbourhood. We leverage the theory of nonlinear random projections and propose the Reduced 1SVM (R1SVM), which is an efficient and scalable anomaly detection technique that can be trained on large-scale datasets. The main objective of R1SVM is to replace a nonlinear machine with randomised features and a linear machine. In summary, we have proposed efficient privacy-preserving anomaly detection approaches for PSNs, and scalable data modelling approaches for high-dimensional datasets, which lower the computational and memory complexity compared to traditional anomaly detection techniques. We have shown that the proposed methods achieve higher or comparable accuracy in detecting anomalies compared to existing state-of-art techniques.
  • Item
    Thumbnail Image
    Anomaly detection in heterogeneous sensed data
    MOSHTAGHI, MASUD ( 2013)
    Wireless Sensor Networks (WSNs) provide a cost-effective platform for monitoring and data collection in environments where the deployment of wired sensing infrastructure is too expensive or impractical. Many applications of WSNs involve detecting an event in the environment. Gathering all the data from the sensors and trying to analyze it to find events is a cumbersome task as the target event usually happens infrequently. Given the energy-intensive nature of radio transmissions, the limited energy resources of the network can quickly become depleted if the raw data from the nodes has to be transmitted to a single location. Therefore, a major challenge is how to detect interesting or abnormal measurements in the large volume of temporally and spatially correlated data. This research has developed efficient anomaly detection algorithms through modeling the normal behavior of the measurements in wireless sensor networks. These algorithms are able to identify events and faults in the monitored environment while reducing communication between the nodes, thus saving the limited energy of the nodes. We first introduce a framework for anomaly detection with efficient local processing using hyperellipsoidal summaries of the data at the nodes, and global processing of these local summaries. The global processing provides us with an understanding of the network as whole and helps us model different characteristics of the data in non-homogeneous networks. In contrast, the local processing helps to identify interesting measurements or patterns locally at each node. We show that this framework can significantly reduce the communication overhead of a centralized scheme where all the data is transmitted to the sink. The rest of this research is focused on improving the global and local data processing aspects of this framework. We propose an efficient clustering algorithm that can be executed with the limited computational capabilities of sensor nodes. This algorithm allows the nodes to build multiple hyperellipsoidal summaries of their local data, which can then be forwarded to the base station. This local data processing method can be used when multiple distributions may appear in the data of a single node. The base station compares and clusters the elliptical summaries from the nodes to find a global model for the network. Therefore, the accuracy of the global model largely relies on the definition of (dis)similarity between the hyeperellipsoids. We introduce three similarity measures for pairs of ellipsoids that take shape, orientation and location of the hyeperellipsoids into consideration. First an underlying theory and proof for each of these measures has been provided, and then they have been compared and evaluated on different sets of synthetic and real-life datasets. We then present an adaptive method that allows the model to change after the training period. It starts with a small batch of data for the initialization, and then incrementally updates the parameters of the global hyperellipsoidal decision boundaries using the available data at the base station. The sink uses the anomaly messages sent by the nodes to adapt the model. We finally propose two incremental data modeling approaches, which are designed to suit the data streaming nature of WSNs. The first model is called Incremental Data Capture Anomaly Detection (IDCAD), which and iteratively calculates an elliptical boundary for anomaly detection at a node. This model is able to detect changes and anomalies in the long term characteristics of the data. The second model is a predictive dynamic model called an Iterative Fuzzy Regression Model (iFRM). This model benefits from the IDCAD model and can detect long term anomalies, while its prediction capability gives it the ability to detect dynamic anomalies as well. These two approaches provide real-time decision making at the node level. In summary, we have proposed efficient data modeling approaches for anomaly detection in WSNs and a framework for distributed decision making which lowers the communication overhead in the network compared to a communication intensive centralized scheme. We have shown that the proposed methods achieve higher or comparable accuracy in detecting anomalies compared to existing state-of-art techniques.