Computing and Information Systems - Theses
Permanent URI for this collection
Now showing 1 - 2 of 2
ItemSimilarity analysis with advanced relationships on big dataHuang, Jin ( 2015)Similarity analytic techniques such as distance based joins and regularized learning models are critical tools employed in numerous data mining and machine learning tasks. We focus on two typical such techniques in the context of large scale data and distributed clusters. Advanced distance metrics such as the Earth Mover's Distance (EMD) are usually employed to capture the similarity between data dimensions. The high computational cost of EMD calls for a distributed solution, yet it is difficult to achieve a balanced workloads given the skewed distribution of the EMDs. We propose efficient bounding techniques and effective workload scheduling strategies on the Hadoop platform to design a scalable solution, named HEADS-Join. We investigate both the range joins and the top-k joins, and explore different computation paradigms including MapReduce, BSP, and Spark. We conduct comprehensive experiments and confirm that the proposed techniques achieve an order of magnitude speedup over the state-of-the-art MapReduce join algorithms. The hypergraph model is demonstrated to achieve excellent effectiveness in a wide range of applications where high-order relationships are of interest. When processing a large scale hypergraph, the straightforward approach is to convert it to a graph and reuse the distributed graph frameworks. However, such an approach significantly increases the problem size, incurs excessive replicas due to partitioning, and renders it extremely difficult to achieve a balanced workloads. We propose a novel scalable framework, named HyperX, to directly operate on a distributed hypergraph representation and minimize the numbers of replicas while still maintain a great workload balance among the distributed machines. We closely investigate an optimization problem of partitioning a hypergraph in the context of distributed computation. With extensive experiments, we confirm that HyperX achieve an order of magnitude improvement over the graph conversion approach in terms of the execution time, network communication, and memory consumption.
ItemTowards high-dimensional classification using Michigan style generic based machine learningABEDINI, MANI ( 2013)High-dimensional classification problems arise frequently in many areas of science and engineering. Examples include: disease classification based on gene expression profiles, document classification, image recognition and fraud detection. Machine learning algorithms are widely used to tackle such problems. Typically, statistical techniques or artificial intelligence methods are used to build models that can learn from the data and subsequently new observations can be classified as belonging to a particular category of the data set. Genetic-based machine learning (GBML), which combines evolutionary algorithms and reinforcement learning, has been successful applied in a wide range of complex problem solving and classification tasks. The eXtended Classifier System (XCS), a well-known GMBL technique, evolves a population of classifiers in the form of condition-action rules that can be used for classification problems. A key step in XCS’s iterative learning process is the rule discovery component which creates new classifiers to be added to the bounded population. Unfortunately, population-based approaches are inherently slow when faced with large-scale problems. This may in part be attributed to the additional cost associated with maintaining relatively large populations of classifiers, often encapsulating multiple features. Consequently, few studies have examined the application of XCS in high-dimensional cases. The over-arching aim of this thesis is to develop new GBML classification models, based on an underlying XCS architecture, for high-dimensional classification problems. The objective it to improve the performance measured in terms of accuracy, population diversity and execution costs. To do this, three alternative approaches have been proposed and evaluated: In the first approach, we use “feature quality” to guide the XCS rule discovery process. Here, a pre-processing phase is used to extract feature quality information. Subsequently, the feature quality information is used to bias the evolutionary operators via a “gene mask”. Detailed numerical simulation experiments encapsulating a number of alternative feature ranking methods, parameter setting and benchmark data sets suggest that our proposed model can improve the classification performance of data sets with large numbers of features. In the second approach, a hybrid multi-population ensemble classifier inspired by co-evolutionary algorithms and ensemble learning is used to improve the baseline-XCS performance. In this model, separate sub-populations are evolved in parallel using a binary gene mask to guide the evolutionary operators. Sub-population outputs are weighted in a typical ensemble style to produce an output. As is typical in parallel evolutionary models of this form, we investigate alternative “migration” methods between sub-populations as well as specific mechanisms to adapt the gene mask within a sub-population. Numerical simulation results have shown that the multi-population model has superior performance to the single population model. Significantly, by combining feature quality information and the hybrid co-evolutionary XCS, superior performance can be achieved when measured across the benchmark data sets. The focus of the third approach is different from the previous approaches, in that emphasis is placed more on investigating techniques to reduce execution time of the proposed GMBL models for high-dimensional complex classification tasks using alternative parallel and distributed architectures. We start by comparing deployment techniques and the mapping of the multi-population GBML models using parallel or distributed programming libraries, before listing advantages and disadvantages of each approach. This section concludes with a preliminary study investigating the use of cloud computing infrastructures to run our proposed model. Large-scale classification problems pose many challenges for traditional machine learning algorithms. In this thesis, we have proposed novel extensions to XCS to meet such challenges. Our contributions are based on the development of informed rule discovering mechanisms that employ gene masks, co-evolving sub-populations and parallel and distributed architectures. The experimental results have demonstrated that the proposed models have better performance in terms of accuracy, execution costs and promoting population diversity across the data sets considered – benchmark low dimensional data sets; synthetic data sets and high-dimensional gene expression profile data sets. There are further opportunities to extend the proposed models in interesting ways, including: examining feature interactions more closely and exploring the effectiveness of alternative evolutionary and statistical models.