Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    A suite of greedy methods for set cover computation
    Lim, Ching Lih ( 2015)
    The Set Cover problem is simple to describe: if a collection of sets is given, each containing a subset of a universe of elements, the challenge is to find a smallest (cardinality) sub-collection such that the union of the sets in the sub-collection is equal to the universe. Since this problem is NP-hard, there is no known efficient exact algorithm for non-trivial problem instances. An approximation algorithm is thus employed to construct a nearly optimal solution within reasonable resource bounds. One such algorithm is the greedy approach, which repeatedly adds the set with the greatest number of uncovered elements to the solution until every element in the universe is covered. In this thesis we make several contributions to Set Cover and the greedy approach. First, the approach can be implemented with a priority queue to keep track of set sizes so as to always be able to quickly find a set with the most uncovered elements. However, this implementation is costly in one aspect as some or all sets in the queue must be updated to their correct sizes after adding one set. A variant of the approach, Lazy-Greedy, is therefore proposed to evaluate sets lazily, reducing that cost. We show that while the worst-case time complexity is worsened, on typical input instances the revised approach implementation is faster than the standard eager greedy implementation, while retaining the same approximation ratio of the greedy method. Second, to process a large instance on secondary storage efficiently, Lazy-Greedy is re-engineered to break an input into a logarithmic number of buckets of related set sizes, and to process each bucket lazily. It has two innovations: automated management of a large number of buckets, and a fast search for a particular bucket. Although the algorithm handles a number of large instances efficiently, it runs more slowly than one benchmark disk-friendly algorithm that has a weaker approximation ratio. Finally, we propose two input-processing algorithms of the same following task given that some sets in a typical problem instance have hapax legomena (singleton elements). These sets must appear in every valid solution, including the optimal solution, because the hapax legomena will not be covered otherwise. A partial solution of such sets can be pre-built, with remaining sets and elements passed to a Set Cover algorithm to solve and complete the solution. The in-memory algorithm of counting element occurrences produces complete solutions faster than the direct approach of solving without the pre-processing. The disk-oriented algorithm of transforming an instance into a list of set-element items solves slower than a semi-external covering algorithm due to the relatively expensive cost of sorting the list.
  • Item
    Thumbnail Image
    Highly scalable and effective algorithms for training support vector machines
    WEN, ZEYI ( 2015)
    Support Vector Machines (SVMs) are one of the most popular machine learning models in data mining. When the available labelled data is insufficient, semi-supervised SVM (S^3VM) training is an important technique to make use of unlabelled data in the learning process. However, existing S^3VM training algorithms have the following two major limitations. First, they are inefficient and not scalable to large datasets because they perform the expensive supervised SVM training for many times. Second, they have no mechanism to control precision/recall individually in the classification problem, or to control overestimated/underestimated errors individually in the regression problem. Nevertheless, many applications have a precision/recall or overestimation/underestimation preference. As mining large datasets is becoming more common, there is a compelling need for an efficient and scalable S^3VM training algorithm with a precision/recall or overestimation/underestimation preference. In this thesis, we aim to overcome the two limitations of the existing S^3VM training algorithms. To improve the supervised SVM training for overcoming the first limitation (i.e., the efficiency and scalability limitation) of the existing S^3VM training algorithms, we have identified that two main operations---kernel value computation and search for the minimum/maximum value---limit the efficiency and scalability of the supervised SVM training. We explore the high-performance of Graphics Processing Units (GPUs) and Solid-state Drives (SSDs), and find that GPUs and SSDs are well-suited to accelerate the two operations. The key challenge of our work is to efficiently handle large datasets using GPUs with relatively small memory. To begin with, we propose a highly efficient and scalable supervised SVM training algorithm for classification through the following key ideas. (i) To avoid holding the whole dataset in the GPU memory and avoid performing repeated kernel value computations, we precompute the kernel values and reuse them. (ii) We store the precomputed kernel values to a high-speed storage framework, consisting of the CPU memory extended by SSDs and the GPU memory as a cache, so that reusing (i.e., reading) kernel values takes much lesser time than computing them on-the-fly. (iii) To further improve the efficiency of the SVM training, we apply a number of techniques to the search algorithm to make efficient use of the GPU shared memory and registers, propose a caching strategy well-suited to the storage framework, and parallelise the tasks on the GPU and the CPU. Moreover, we develop an efficient and scalable supervised SVM training algorithm for regression. Apart from using our above mentioned techniques, we propose additional techniques to enhance the efficiency of the supervised SVM training algorithm for regression. Our key ideas are as follows. (i) We exploit the fact that the training examples are doubled in the SVM regression problem, and hence we use a smaller matrix to represent the kernel matrix to save computation and storage. (ii) We store the kernel values in a way that they can be read in parallel to make use of the multi-channel architecture of SSDs. (iii) We design a search algorithm that minimises the search space and avoids branch divergence among GPU threads. To overcome the second limitation (i.e., no mechanism to specify a preference) of the existing S^3VM training algorithms, we propose an effective S^3VM training algorithm that enables the control of precision/recall. We provide discussion on extending our algorithm to allow the control of overestimated/underestimated errors, and we focus our research and implementation on the S^3VM training algorithm that allows to specify a precision/recall preference. Our key idea is that we divide the S^3VM training process into multiple rounds of the supervised SVM training processes, and the classifier trained at each round is calibrated using a subset of the labelled dataset before we use it on the unlabelled dataset for enlarging the training dataset. Our experimental results show that our S^3VM training algorithm can train SVMs with a precision/recall preference, and our algorithm consistently outperforms the existing algorithm by orders of magnitude.