Computing and Information Systems

With increasing volumes of data, it is imperative that data analysis can appropriately scale. However, many common machine learning algorithms, e.g., metric learning, manifold landmark learning, and processing trajectories, suffer poor computational complexity in the size of training data. In this thesis, we propose generic methods for scaling up learning algorithms by utilizing locality sensitive hashing. First, finding representative samples utilizing locality sensitive hashing is proposed. The usefulness of these samples is demonstrated on large-scale supervised metric learning. Our methods achieve quadratic speed up with only minimal decrease in accuracy. Second, representative samples are leveraged for adaptive minibatch selection for fitting Gaussian processes for landmarking manifolds. Our methods exploit the compatibility of locality sensitive hashing and the manifold assumption in high-dimensional data, thereby limiting expensive optimization to relevant regions of the data. Training the state-of-the-art learner with our compressed dataset achieves superior accuracy compared to training with randomly selected samples. We also demonstrate that our methods can be used to find manifold landmarks without learning Gaussian processes at all, which leads to orders-of-magnitude speed up with only minimal decrease in accuracy. And finally, we propose locality sensitive hashing based feature hashing methods which map variable length trajectories to constant length trajectories for efficient similarity computation in Euclidean space. Our methods can accelerate trajectory clustering while achieving competitive accuracy in comparison to clustering using more complicated distance function, such as Dynamic Time Warping.

Data clustering and community detection in networks are two important tasks in machine learning which aim to group the data into similar objects or densely connected sub-graphs. However, applying an appropriate similarity measure to obtain the highest accuracy is always a challenge. Furthermore, in some real- world applications, some background knowledge and information exists about the true or desired assignments or properties of clusters and communities. The side-information could be obtained by experiments, domain knowledge or user preferences in different applications. Some constraints may also be imposed to the system due to natural complexity of the problem or resource limitations. Community detection (clustering) in the presence of side-information represented as supervision constraints is called semi-supervised community detection (clustering). However, finding efficient approaches to take the most advantage of this pre-existing information to improve quality of the solutions is still a challenge. In this thesis, we study community detection and clustering problems with and without incorporating domain knowledge for which we propose a similarity measure and exact and approximate optimization techniques to improve the accuracy of the results. In this thesis, we address limitations of a popular community detection measure called modularity and propose an improved measure called generalized modularity which quantifies similarity of network vertices more realistically and comprehensively by incorporating vertex similarity concepts. The pro- posed generalized modularity outperforms the state of the art modularity optimization approach in community detection. In addition, to incorporate background knowledge and user preferences into community detection process, two semi-supervised approaches are proposed in this thesis: First we address the modelling flexibility issue in the literature of semi- supervised community detection to simultaneously model instance level and community level constraints. We propose a generic framework using constraint programming technology which enables incorporating a variety of instance level, community level and complex supervision constraints at the same time. The framework also enables modelling local community definitions as constraints to a global community detection scheme and is able to incorporate a variety of similarity measures and community detection objective functions. Using a high level modelling language enables the proposed semi-supervised community detection framework to have the flexibility of applying both complete and incomplete techniques and has the advantage of proving optimality of the solutions found when using complete techniques. Second, a new algorithm for semi-supervised community detection is pro- posed based on discrete Lagrange multipliers which incorporates pairwise constraints. Unlike most of the existing semi-supervised community detection schemes that modify the graph adjacency based on the supervision constraints, the pro- posed algorithm works with quality measures such as modularity or generalized modularity and guides the community detection process by systematically modifying the similarity matrix only for hard-to satisfy constraints. The pro- posed algorithm commits to satisfy (almost) all of the constraints to take the most advantage of the existing supervision. It outperforms the existing semi- supervised community detection algorithms in terms of satisfying the supervision constraints and noise resistance. Another contribution of this thesis is to incorporate instance level supervision constraints into clustering problem. In this regard, a k-means type semi- supervised clustering algorithm is proposed which can take the most advantage of the pre-existing information to achieve high quality solutions satisfying the constraints. The proposed algorithm is based on continuous Lagrange multipliers and penalizes the constraint violations in a systematic manner which guides the cluster centroids and cluster assignments towards satisfying all of the constraints. The achievements of this thesis are supported by several experiments and publications.

Computing and Information Systems - Theses

Permanent URI for this collection

Filters

Date

Author

Subject

Type

Settings

Sort By

Results per page

Statistics

Citations

Search Results