Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Scalable clustering of high dimensional data in non-disjoint axis-parallel subspaces
    Doan, Minh Tuan ( 2019)
    Clustering is the task of grouping similar objects together, where each group formed is called a cluster. Clustering is used to discover hidden patterns or underlying structures from the data, and has a wide range of applications in areas such as the Internet of Things (IoT), biology, medicine, marketing, business, and computing. Recent developments in sensor and storage technology have led to a rapid growth of data, both in terms of volume and dimensionality. This raises challenges for existing clustering algorithms and led to the development of subspace clustering algorithms that cope with the characteristics, volumes, and dimensionality of the datasets that are now available. In this thesis, we address the challenges of finding subspace clusters in high dimensional data to achieve subspace clustering with high quality and scalability. We provide a comprehensive literature review of existing algorithms, and identify the open challenges in subspace clustering of high dimensional data that we address in this thesis, namely: devising appropriate similarity measures, finding non-disjoint subspace clusters, and achieving scalability to high dimensional data. We further illustrate these challenges in a real-life application. We show that clustering can be used to construct a meaningful model of the pedestrian distribution in the City of Melbourne, Australia, in low dimensional space. However, we demonstrate that the clustering quality deteriorates rapidly as the number of dimensions (pedestrian observation points) increase. This also serves as a motivating example on why subspace clustering is needed and what challenges need to be addressed. We first address the challenge of measuring similarity between data points, which is a key challenge in analyzing high dimensional data that directly impacts the clustering results. We propose a novel method that generates meaningful similarity measures for subspace clustering. Our proposed method considers the similarity between any two points as the union of base similarities that are frequently observed in lower dimensional subspaces. This allows our method to first search for similarity in lower dimensional subspaces and aggregate these similarity values to determine the overall similarity. We show that this method can be applied for measuring similarity based on distance, density, and grids, which enables our similarity measurements to be used with different types of clustering algorithms, i.e., distance-based, density-based, and grid-based clustering algorithms. We then use our similarity measurement to build a subspace clustering algorithm that can find clusters in non-disjoint subspaces. Our proposed algorithm follows a bottom-up strategy. The first phase of our algorithm searches for base clusters in low dimensional subspaces. Subsequently, the second phase forms clusters in higher dimensional subspaces by aggregating these base clusters. The novelty of our proposed method is reflected in both phases. First, we show that our similarity measurement can be integrated in a subspace clustering algorithm. This not only helps prevent the false formation of clusters, but also significantly reduces the time and space complexity of the algorithm by pruning irrelevant subspaces at an early stage. Second, our algorithm transforms the common sequential aggregation of base clusters into a problem of frequent pattern mining. This enables efficient formation of clusters in high dimensional subspaces using FP-Trees. We then demonstrate that our proposed algorithm can outperform traditional subspace clustering algorithms using bottom-up strategies, as well as state-of-the-art algorithms with other clustering strategies, in terms of clustering quality and scalability to large volumes and high dimensionality of data. Subsequently, we evaluate the ability of our proposed subspace clustering algorithm to find clusters in datasets from different real-life applications. We conduct experiments on datasets from three different applications. First, we apply our proposed clustering algorithm to pedestrian measurements in the City of Melbourne, and construct a meaningful model that describes the profiles of the distributions of pedestrians that correspond to pedestrian activities at different times of the day. We then use our clustering algorithm to analyze the impacts of a major change in the public transport of Melbourne on the activities of pedestrians. In the second application, we evaluate the ability of the proposed algorithm to work with very high dimensional data. Specifically, we apply our algorithm on ten gene expression datasets, which comprise up to 10,000 dimensions. Next, we explore the ability of our algorithm to produce clustering results that can be used as an intermediate step that assists the construction of a more complicated model. Specifically, we use our clustering result to build an ensemble classification model, and show that this model improves the accuracy of predicting the car parking occupancy in the central business district (CBD) of the City of Melbourne. By applying our proposed methods in a wide range of applications on datasets with different sizes and dimensionalities, we demonstrate the ability of our algorithm to cluster high dimensional datasets that possess complex structures with high levels of noise and outliers to produce meaningful clustering results that can have practical impact.
  • Item
    Thumbnail Image
    Scaling learning algorithms using locality sensitive hashing
    Aye, Zay Maung Maung ( 2018)
    With increasing volumes of data, it is imperative that data analysis can appropriately scale. However, many common machine learning algorithms, e.g., metric learning, manifold landmark learning, and processing trajectories, suffer poor computational complexity in the size of training data. In this thesis, we propose generic methods for scaling up learning algorithms by utilizing locality sensitive hashing. First, finding representative samples utilizing locality sensitive hashing is proposed. The usefulness of these samples is demonstrated on large-scale supervised metric learning. Our methods achieve quadratic speed up with only minimal decrease in accuracy. Second, representative samples are leveraged for adaptive minibatch selection for fitting Gaussian processes for landmarking manifolds. Our methods exploit the compatibility of locality sensitive hashing and the manifold assumption in high-dimensional data, thereby limiting expensive optimization to relevant regions of the data. Training the state-of-the-art learner with our compressed dataset achieves superior accuracy compared to training with randomly selected samples. We also demonstrate that our methods can be used to find manifold landmarks without learning Gaussian processes at all, which leads to orders-of-magnitude speed up with only minimal decrease in accuracy. And finally, we propose locality sensitive hashing based feature hashing methods which map variable length trajectories to constant length trajectories for efficient similarity computation in Euclidean space. Our methods can accelerate trajectory clustering while achieving competitive accuracy in comparison to clustering using more complicated distance function, such as Dynamic Time Warping.
  • Item
    Thumbnail Image
    Semi-supervised community detection and clustering
    Ganji, Mohadeseh ( 2017)
    Data clustering and community detection in networks are two important tasks in machine learning which aim to group the data into similar objects or densely connected sub-graphs. However, applying an appropriate similarity measure to obtain the highest accuracy is always a challenge. Furthermore, in some real- world applications, some background knowledge and information exists about the true or desired assignments or properties of clusters and communities. The side-information could be obtained by experiments, domain knowledge or user preferences in different applications. Some constraints may also be imposed to the system due to natural complexity of the problem or resource limitations. Community detection (clustering) in the presence of side-information represented as supervision constraints is called semi-supervised community detection (clustering). However, finding efficient approaches to take the most advantage of this pre-existing information to improve quality of the solutions is still a challenge. In this thesis, we study community detection and clustering problems with and without incorporating domain knowledge for which we propose a similarity measure and exact and approximate optimization techniques to improve the accuracy of the results. In this thesis, we address limitations of a popular community detection measure called modularity and propose an improved measure called generalized modularity which quantifies similarity of network vertices more realistically and comprehensively by incorporating vertex similarity concepts. The pro- posed generalized modularity outperforms the state of the art modularity optimization approach in community detection. In addition, to incorporate background knowledge and user preferences into community detection process, two semi-supervised approaches are proposed in this thesis: First we address the modelling flexibility issue in the literature of semi- supervised community detection to simultaneously model instance level and community level constraints. We propose a generic framework using constraint programming technology which enables incorporating a variety of instance level, community level and complex supervision constraints at the same time. The framework also enables modelling local community definitions as constraints to a global community detection scheme and is able to incorporate a variety of similarity measures and community detection objective functions. Using a high level modelling language enables the proposed semi-supervised community detection framework to have the flexibility of applying both complete and incomplete techniques and has the advantage of proving optimality of the solutions found when using complete techniques. Second, a new algorithm for semi-supervised community detection is pro- posed based on discrete Lagrange multipliers which incorporates pairwise constraints. Unlike most of the existing semi-supervised community detection schemes that modify the graph adjacency based on the supervision constraints, the pro- posed algorithm works with quality measures such as modularity or generalized modularity and guides the community detection process by systematically modifying the similarity matrix only for hard-to satisfy constraints. The pro- posed algorithm commits to satisfy (almost) all of the constraints to take the most advantage of the existing supervision. It outperforms the existing semi- supervised community detection algorithms in terms of satisfying the supervision constraints and noise resistance. Another contribution of this thesis is to incorporate instance level supervision constraints into clustering problem. In this regard, a k-means type semi- supervised clustering algorithm is proposed which can take the most advantage of the pre-existing information to achieve high quality solutions satisfying the constraints. The proposed algorithm is based on continuous Lagrange multipliers and penalizes the constraint violations in a systematic manner which guides the cluster centroids and cluster assignments towards satisfying all of the constraints. The achievements of this thesis are supported by several experiments and publications.