Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Anomaly detection in data streams: challenges and techniques
    Salehi, Mahsa ( 2015)
    Anomaly detection in data streams plays a vital role in on-line data mining applications, such as network intrusion detection, environmental monitoring and road traffic analysis. However, there are significant challenges with anomaly detection in streaming environments and in this thesis we propose effective and efficient techniques to address these challenges. A major challenge for anomaly detection in these applications is the dynamically changing nature of these monitoring environments. This causes a problem for traditional anomaly detection techniques, which assume a relatively static monitoring environment, and hence construct a static model of normal behaviour as the basis for anomaly detection. However, in an environment that is intermittently changing (known as a switching data stream), such an approach can yield a high error rate in terms of false positives. To cope with the challenge of dynamic environments, we require an approach that can learn from the history of normal behaviour in a data stream, while accounting for the fact that not all time periods in the past are equally relevant. Consequently, to address this problem first we propose a relevance-weighted ensemble model for learning normal behaviour, which forms the basis of our anomaly detection scheme. The second challenge for anomaly detection in data streams is the high rate of incoming observations. Since traditional approaches in anomaly detection require multi-passes over datasets, they are not applicable to data streams. In terms of streaming data, processing each observation multiple times is not feasible due to the unbounded amount of data that is generated at a high rate. The advantage of our proposed relevance-weighted ensemble model is that it can improve the accuracy of detection by making use of relevant history, while remaining computationally efficient, and addresses both major challenges in data streams. We then propose two ensemble based approaches called Biased SubSampling (BSS) and Diversity-based Biased SubSampling (DBSS) for anomaly detection, where we improve the detection accuracy of each ensemble detector on one hand and induce diversity among them on the other hand, both in an unsupervised manner. We discuss the effectiveness of our approaches in terms of the bias-variance trade-off. Such an approach is effective in terms of improving the detection accuracy of outliers and can be potentially used in streaming data. With the growing need to analyze high speed data streams, the task of anomaly detection becomes even more challenging as traditional anomaly detection techniques can no longer assume that all the data can be stored for processing. This motivates our third major challenge in anomaly detection in data streams, i.e., the unbounded quantity of data points. To address this challenge we propose a memory efficient incremental local outlier detection algorithm for data streams called MiLOF, and a more flexible version called MiLOF_F, which have an accuracy close to incremental local outlier factor (iLOF) algorithm, a well-known density based outlier detection algorithm, but within a fixed memory bound and with lower time complexity. Hence, our proposed methods are well suited to application environments with limited memory (e.g., wireless sensor networks), and can be applied to high volume data streams. In addition, MiLOF_F is robust to changes in the number of data points, the number of underlying clusters and the number of dimensions in the data stream. While a variety of approaches have been widely used in supervised learning tasks, all of our approaches in this thesis provide novel contributions through the use of ensemble techniques and clustering models for anomaly detection in data streams in an unsupervised manner, which is the case in many real applications. Finally, this thesis is concluded by a case study of car racing driver distraction using EEG brain signals of drivers, as an example of of a potential application for anomaly detection in data streams.