Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 5 of 5
  • Item
    Thumbnail Image
    Robust and efficient unsupervised anomaly detection in complex and dynamic environments
    Ghafoori, Zahra ( 2018)
    Creating value from data is the ultimate goal of all sensing platforms within the Internet of Things (IoT). This value is derived from data analytics that can support effective and timely decision making for a variety of business applications. An important part of this process is anomaly detection, which is typically used for two main purposes. First, anomaly detection can be used to clean data prior to applying pattern classification or prediction techniques. This purpose is often referred to as outlier removal or data cleansing instead of anomaly detection. Second and more importantly, anomaly detection can be used to detect events that have practical impact, such as detecting intrusions or identifying faults. Anomaly detection for the second purpose has attracted considerable attention in the last decade to detect events with usually negative effects. Such events should be detected and dealt with as soon as possible to minimise their impact. The wide use of sensors that collect data from our smart-phones to advanced monitoring systems has created new applications, which introduce new challenges for machine learning and data mining techniques. The sensor data is being generated at increasingly higher rates and in higher dimensional data streams. In this context, the unavailability of labelled training data, changes in normal behaviour over time, and time and memory constraints demand new methods for unsupervised anomaly detection. In this thesis, we look at the challenges for anomaly detection in this context, and propose novel methods to cope with these emerging challenges. We first motivate the need for unsupervised model-based anomaly detection and provide a comprehensive survey of methods from this view. We then enhance a widely used semi-supervised model-based anomaly detection technique to be able to perform unsupervised anomaly detection in a robust and efficient manner. This method builds the foundations of our unsupervised model-based anomaly detection framework, which combines two similarity measures to effectively and efficiently detect unknown anomalies that appear in clusters or as individual instances in a training set. Based on our unsupervised model-based anomaly detection framework, we propose an unsupervised adaptive anomaly detection technique for non-stationary environments. This technique is capable of detecting unknown change points in normal behaviour using an unsupervised change point detection module. In addition, it can automatically select relevant instances from recent instances in a sliding window for leaning new patterns. Our technique is ensemble-based and manages the ensemble of experts by autonomously removing obsolete models. Learning new models is performed using our unsupervised model-based anomaly detection framework. Finally, this technique is efficient for nonstationary environments, because it does not require continuous learning. We propose an unsupervised dimensionality reduction technique to learn data representations that preserve or enlarge the dissimilarity of normal data and anomalies. This method increases the robustness of our unsupervised model-based anomaly detection framework especially when there are clusters of anomalies in a training set. We finally propose a novel unsupervised anomaly detection technique based on estimating data density, sampling and cluster validity checking. This new method has constant training time and labels each new instance in (near) real-time. In addition, it is robust to the presence of a high fraction of anomalies in its training set, and handles various types of anomalies. Finally, if a limited budget is available to boost the accuracy of unsupervised learning on a dataset by asking a few data labels from an expert, our new method can make smart decisions and ask for labels of critical patterns. In summary, we propose efficient model-based and unsupervised dimensionality reduction and anomaly detection methods. The proposed methods can be used for anomaly detection in general, when the environment is non-stationary, or for (near) real-time decision making. We show that the proposed methods achieve higher or comparable accuracy in detecting anomalies compared to existing state-of-the-art techniques.
  • Item
    Thumbnail Image
    Anomaly detection in large evolving graphs
    Rashidi, Lida ( 2017)
    Anomaly detection plays a vital role in various application domains including network intrusion detection, environmental monitoring and road traffic analysis. However a major challenge in anomaly detection is how to mine datasets where objects possess causal/non-causal relationships such as friendship, citation and communication relationships. This type of relational data can be represented as a graph, and raises the challenges of how to extend anomaly detection to the domain of relational datasets such as graphs. Anomalies often provide valuable insights into the correlation between an abnormal pattern and a real world phenomenon. Therefore, there has been growing attention towards anomaly detection schemes in dynamic networks. Although these evolving networks impose a curse of dimensionality on the learning models, they usually contain structural properties that anomaly detection schemes can exploit. The major challenge is finding a feature extraction technique that preserves graph structure while balancing the accuracy of the model against its scalability. Another challenge in graph anomaly detection is interpretability and visualization. In this thesis, we tackle these challenges and propose multiple approaches, where each approach addresses several or all of these challenges. Our first scheme is the use of a scalable technique known as random projection as a method for structure-aware embedding. We show that this scheme can preserve useful relational properties of the network, and present an analytical proof of this claim. We also analyse the effect of embedding on the accuracy of one-class support vector machines for anomaly detection on real and synthetic graph datasets. We demonstrate that the embedding can be effective in terms of scalability without detrimental influence on the accuracy of the learned model. Our next approach focuses on community memberships and how they can change during graph evolution. We devise a graph specific non-negative matrix factorization technique, which takes into account the influence that vertices have on each other. This method is interpretable in terms of event attribution and can provide a low-rank approximation of the original graph. Moreover, this approach is applicable in community detection and monitoring community evolution throughout time. Finally, we propose a more generalized graph pre-processing technique that can both extract structural features from the networks and expedite graph mining techniques such as anomaly detection. We focus on detecting anomalies in a sequence of graphs based on rank correlations of the reordered nodes. The merits of our approach lie in its simplicity and resilience to challenges such as unsupervised input, large volumes and high velocities of data. We evaluate the scalability and accuracy of our method on real graphs, where our method facilitates graph processing while producing more deterministic orderings. We show that the proposed approach is capable of revealing anomalies in a more efficient manner based on node rankings. Furthermore, our method can produce visual representations of graphs that are useful for graph compression and can be used as a pre-processing step to expedite common graph mining algorithms such as PageRank.
  • Item
    Thumbnail Image
    Anomaly detection in data streams: challenges and techniques
    Salehi, Mahsa ( 2015)
    Anomaly detection in data streams plays a vital role in on-line data mining applications, such as network intrusion detection, environmental monitoring and road traffic analysis. However, there are significant challenges with anomaly detection in streaming environments and in this thesis we propose effective and efficient techniques to address these challenges. A major challenge for anomaly detection in these applications is the dynamically changing nature of these monitoring environments. This causes a problem for traditional anomaly detection techniques, which assume a relatively static monitoring environment, and hence construct a static model of normal behaviour as the basis for anomaly detection. However, in an environment that is intermittently changing (known as a switching data stream), such an approach can yield a high error rate in terms of false positives. To cope with the challenge of dynamic environments, we require an approach that can learn from the history of normal behaviour in a data stream, while accounting for the fact that not all time periods in the past are equally relevant. Consequently, to address this problem first we propose a relevance-weighted ensemble model for learning normal behaviour, which forms the basis of our anomaly detection scheme. The second challenge for anomaly detection in data streams is the high rate of incoming observations. Since traditional approaches in anomaly detection require multi-passes over datasets, they are not applicable to data streams. In terms of streaming data, processing each observation multiple times is not feasible due to the unbounded amount of data that is generated at a high rate. The advantage of our proposed relevance-weighted ensemble model is that it can improve the accuracy of detection by making use of relevant history, while remaining computationally efficient, and addresses both major challenges in data streams. We then propose two ensemble based approaches called Biased SubSampling (BSS) and Diversity-based Biased SubSampling (DBSS) for anomaly detection, where we improve the detection accuracy of each ensemble detector on one hand and induce diversity among them on the other hand, both in an unsupervised manner. We discuss the effectiveness of our approaches in terms of the bias-variance trade-off. Such an approach is effective in terms of improving the detection accuracy of outliers and can be potentially used in streaming data. With the growing need to analyze high speed data streams, the task of anomaly detection becomes even more challenging as traditional anomaly detection techniques can no longer assume that all the data can be stored for processing. This motivates our third major challenge in anomaly detection in data streams, i.e., the unbounded quantity of data points. To address this challenge we propose a memory efficient incremental local outlier detection algorithm for data streams called MiLOF, and a more flexible version called MiLOF_F, which have an accuracy close to incremental local outlier factor (iLOF) algorithm, a well-known density based outlier detection algorithm, but within a fixed memory bound and with lower time complexity. Hence, our proposed methods are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams. In addition, MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. While a variety of approaches have been widely used in supervised learning tasks, all of our approaches in this thesis provide novel contributions through the use of ensemble techniques and clustering models for anomaly detection in data streams in an unsupervised manner, which is the case in many real applications. Finally, this thesis is concluded by a case study of car racing driver distraction using EEG brain signals of drivers, as an example of of a potential application for anomaly detection in data streams.
  • Item
    Thumbnail Image
    Anomaly detection in participatory sensing networks
    MONAZAM ERFANI, SARAH ( 2015)
    Anomaly detection or outlier detection aims to identify unusual values in a given dataset. In particular, there is growing interest in collaborative anomaly detection, where multiple data sources submit their data to an online data mining service, in order to detect anomalies with respect to the wider population. By combining data from multiple sources, collaborative anomaly detection aims to improve detection accuracy through the construction of a more robust model of normal behaviour. Cloud-based collaborative architectures such as Participatory Sensing Networks (PSNs) provide an open distributed platform that enables participants to share and analyse their local data on a large scale. Two major issues with collaborative anomaly detection are how to ensure the privacy of participants’ data, and how to efficiently analyse the large-scale high-dimensional data collected in these networks. The first problem we address is the issue of data privacy in PSNs, by introducing a framework for privacy-preserving collaborative anomaly detection with efficient local data perturbation at participating nodes, and global processing of the perturbed records at a data mining server. The data perturbation scheme that we propose enables the participants to perturb their data independently without requiring the cooperation of other parties. As a result our privacy-preservation approach is scalable to large numbers of participants and is computationally efficient. By collecting the participants’ data, the PSN server can generate a global anomaly detection model from these locally perturbed records. The global model identifies interesting measurements or unusual patterns in participants’ data without revealing the true values of the measurements. In terms of privacy, the proposed scheme thwarts several major types of attacks, namely, the Independent Component Analysis (ICA), Distance-inference, Maximum a Posteriori (MAP), and Collusion attacks. We further improve the privacy of our data perturbation scheme by: (i) redesigning the nonlinear transformation to better defend against MAP estimation attacks for normal and anomalous records, and (ii) supporting individual random linear transformations for each participant in order to provide the system with greater resistance to malicious collusion. A notable advantage of our perturbation scheme is that it preserves participants’ privacy while achieving comparable accuracy to non-privacy preserving anomaly detection techniques. The second problem we address in the thesis is how to model and interpret the large volumes of high-dimensional data that are generated in participatory domains by using One-class Support Vector Machines (1SVMs). While 1SVMs are effective at producing decision surfaces for anomaly detection from well-behaved feature vectors, they can be inefficient at modelling the variations in large, high-dimensional datasets. We overcome this challenge by taking two different approaches. The first approach is an unsupervised hybrid architecture, in which a Deep Belief Network (DBN) is used to extract generic underlying features, in combination with a 1SVM that uses the features learned by the DBN. DBNs have important advantages as feature detectors for anomaly detection, as DBNs use unlabelled data to capture higher-order correlations among features. Furthermore, using a DBN to reduce the number of irrelevant and redundant features improves the scalability of a 1SVM for use with large training datasets containing high-dimensional records. Our hybrid approach is able to generate an accurate anomaly detection model with lower computational and memory complexity compared to a 1SVM on its own. Alternatively, to overcome the shortcomings of 1SVMs in processing high-dimensional datasets, in our second approach we calculate a lower rank approximation of the optimisation problem that underlies the 1SVM training task. Instead of performing the optimisation in a high-dimensional space, the optimisation is conducted in a space of reduced dimension but on a larger neighbourhood. We leverage the theory of nonlinear random projections and propose the Reduced 1SVM (R1SVM), which is an efficient and scalable anomaly detection technique that can be trained on large-scale datasets. The main objective of R1SVM is to replace a nonlinear machine with randomised features and a linear machine. In summary, we have proposed efficient privacy-preserving anomaly detection approaches for PSNs, and scalable data modelling approaches for high-dimensional datasets, which lower the computational and memory complexity compared to traditional anomaly detection techniques. We have shown that the proposed methods achieve higher or comparable accuracy in detecting anomalies compared to existing state-of-art techniques.
  • Item
    Thumbnail Image
    Anomaly detection in heterogeneous sensed data
    MOSHTAGHI, MASUD ( 2013)
    Wireless Sensor Networks (WSNs) provide a cost-effective platform for monitoring and data collection in environments where the deployment of wired sensing infrastructure is too expensive or impractical. Many applications of WSNs involve detecting an event in the environment. Gathering all the data from the sensors and trying to analyze it to find events is a cumbersome task as the target event usually happens infrequently. Given the energy-intensive nature of radio transmissions, the limited energy resources of the network can quickly become depleted if the raw data from the nodes has to be transmitted to a single location. Therefore, a major challenge is how to detect interesting or abnormal measurements in the large volume of temporally and spatially correlated data. This research has developed efficient anomaly detection algorithms through modeling the normal behavior of the measurements in wireless sensor networks. These algorithms are able to identify events and faults in the monitored environment while reducing communication between the nodes, thus saving the limited energy of the nodes. We first introduce a framework for anomaly detection with efficient local processing using hyperellipsoidal summaries of the data at the nodes, and global processing of these local summaries. The global processing provides us with an understanding of the network as whole and helps us model different characteristics of the data in non-homogeneous networks. In contrast, the local processing helps to identify interesting measurements or patterns locally at each node. We show that this framework can significantly reduce the communication overhead of a centralized scheme where all the data is transmitted to the sink. The rest of this research is focused on improving the global and local data processing aspects of this framework. We propose an efficient clustering algorithm that can be executed with the limited computational capabilities of sensor nodes. This algorithm allows the nodes to build multiple hyperellipsoidal summaries of their local data, which can then be forwarded to the base station. This local data processing method can be used when multiple distributions may appear in the data of a single node. The base station compares and clusters the elliptical summaries from the nodes to find a global model for the network. Therefore, the accuracy of the global model largely relies on the definition of (dis)similarity between the hyeperellipsoids. We introduce three similarity measures for pairs of ellipsoids that take shape, orientation and location of the hyeperellipsoids into consideration. First an underlying theory and proof for each of these measures has been provided, and then they have been compared and evaluated on different sets of synthetic and real-life datasets. We then present an adaptive method that allows the model to change after the training period. It starts with a small batch of data for the initialization, and then incrementally updates the parameters of the global hyperellipsoidal decision boundaries using the available data at the base station. The sink uses the anomaly messages sent by the nodes to adapt the model. We finally propose two incremental data modeling approaches, which are designed to suit the data streaming nature of WSNs. The first model is called Incremental Data Capture Anomaly Detection (IDCAD), which and iteratively calculates an elliptical boundary for anomaly detection at a node. This model is able to detect changes and anomalies in the long term characteristics of the data. The second model is a predictive dynamic model called an Iterative Fuzzy Regression Model (iFRM). This model benefits from the IDCAD model and can detect long term anomalies, while its prediction capability gives it the ability to detect dynamic anomalies as well. These two approaches provide real-time decision making at the node level. In summary, we have proposed efficient data modeling approaches for anomaly detection in WSNs and a framework for distributed decision making which lowers the communication overhead in the network compared to a communication intensive centralized scheme. We have shown that the proposed methods achieve higher or comparable accuracy in detecting anomalies compared to existing state-of-art techniques.