Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Data stream clustering and anomaly detection
    Chenaghlou, Milad ( 2019)
    Data stream clustering and anomaly detection have grown in importance with the advent of hardware and software technologies that capture and generate continuous streams of sensor data. Stream data mining problems are particularly important in application domains such as network intrusion detection, road traffic analysis, social media analysis and military surveillance systems. However, a number of open challenges need to be addressed in order for stream clustering and anomaly detection to be effectively used in those applications. One of the main challenges regarding data stream clustering and anomaly detection is computational efficiency. In non-stationary data streams in which patterns change over time, algorithms need to identify and adapt to such changes. This requires the ability to test whether the current model accurately represents observed patterns in the stream in an efficient manner. To cope with this challenge, the processing time of the algorithm must scale linearly with the number of observed data points and the memory requirements should be constant. Accordingly, we propose an efficient data stream anomaly detection algorithm that scales linearly with the number of data points. A second challenge is that in many application domains, it is desired that an online clustering algorithm be able to both update the model and identify anomalies in real-time. Current state-of-the-art online clustering algorithms either do not detect anomalies or detect them in a separate process when triggered by the user. Moreover, they only consider the spatial proximity of data points, rather than analyse the evolution of patterns in the stream. We propose an online clustering algorithm that considers the temporal proximity of observations as well as their spatial proximity to identify anomalies in real-time. It identifies the evolution of clusters in noisy streams, incrementally updates the model and calculates the minimum window length over the evolving data stream without jeopardizing performance. Another challenge for clustering data streams is when the number of dimensions increases. In high-dimensional data, conventional distance measures become less meaningful, which limits the effectiveness of distance-based clustering methods. One approach to this challenge is the use of subspace clustering algorithms, which identify a small number of features that can best explain the clusters in the stream. Subspace clustering algorithms for streaming environments address this challenge by reducing the infinite search space of arbitrarily-oriented subspaces into a bounded number of axis-parallel (or projected) subspaces. Accordingly, we propose an arbitrarily oriented subspace clustering algorithm for time-series streams of unbounded length. This algorithm is incremental which makes it suitable for streaming environments, and has lower memory requirements compared to state-of-the-art subspace clustering techniques. In particular, our algorithm can identify emerging subspace clusters, as well as clusters that overlap but appear in different subspaces and timespans. Finally, many data stream clustering algorithms require user-defined threshold parameters to identify and adapt to changes in non-stationary data streams. We propose a novel algorithm in the form of a control structure that can be mounted on other online clustering algorithms to guide them through the changes in the stream. This control structure uses an incremental cluster validity index as a basis for detecting and monitoring changes in the stream. In summary, we propose a range of efficient online anomaly detection and online clustering algorithms for streaming data. These algorithms are suitable for unlabeled data streams, which arise in a variety of real-world applications, and have the flexibility to be used in non-stationary environments where patterns emerge and change over time.