Computing and Information Systems - Theses
Permanent URI for this collection
Now showing 1 - 1 of 1
ItemTowards high-dimensional classification using Michigan style generic based machine learningABEDINI, MANI ( 2013)High-dimensional classification problems arise frequently in many areas of science and engineering. Examples include: disease classification based on gene expression profiles, document classification, image recognition and fraud detection. Machine learning algorithms are widely used to tackle such problems. Typically, statistical techniques or artificial intelligence methods are used to build models that can learn from the data and subsequently new observations can be classified as belonging to a particular category of the data set. Genetic-based machine learning (GBML), which combines evolutionary algorithms and reinforcement learning, has been successful applied in a wide range of complex problem solving and classification tasks. The eXtended Classifier System (XCS), a well-known GMBL technique, evolves a population of classifiers in the form of condition-action rules that can be used for classification problems. A key step in XCS’s iterative learning process is the rule discovery component which creates new classifiers to be added to the bounded population. Unfortunately, population-based approaches are inherently slow when faced with large-scale problems. This may in part be attributed to the additional cost associated with maintaining relatively large populations of classifiers, often encapsulating multiple features. Consequently, few studies have examined the application of XCS in high-dimensional cases. The over-arching aim of this thesis is to develop new GBML classification models, based on an underlying XCS architecture, for high-dimensional classification problems. The objective it to improve the performance measured in terms of accuracy, population diversity and execution costs. To do this, three alternative approaches have been proposed and evaluated: In the first approach, we use “feature quality” to guide the XCS rule discovery process. Here, a pre-processing phase is used to extract feature quality information. Subsequently, the feature quality information is used to bias the evolutionary operators via a “gene mask”. Detailed numerical simulation experiments encapsulating a number of alternative feature ranking methods, parameter setting and benchmark data sets suggest that our proposed model can improve the classification performance of data sets with large numbers of features. In the second approach, a hybrid multi-population ensemble classifier inspired by co-evolutionary algorithms and ensemble learning is used to improve the baseline-XCS performance. In this model, separate sub-populations are evolved in parallel using a binary gene mask to guide the evolutionary operators. Sub-population outputs are weighted in a typical ensemble style to produce an output. As is typical in parallel evolutionary models of this form, we investigate alternative “migration” methods between sub-populations as well as specific mechanisms to adapt the gene mask within a sub-population. Numerical simulation results have shown that the multi-population model has superior performance to the single population model. Significantly, by combining feature quality information and the hybrid co-evolutionary XCS, superior performance can be achieved when measured across the benchmark data sets. The focus of the third approach is different from the previous approaches, in that emphasis is placed more on investigating techniques to reduce execution time of the proposed GMBL models for high-dimensional complex classification tasks using alternative parallel and distributed architectures. We start by comparing deployment techniques and the mapping of the multi-population GBML models using parallel or distributed programming libraries, before listing advantages and disadvantages of each approach. This section concludes with a preliminary study investigating the use of cloud computing infrastructures to run our proposed model. Large-scale classification problems pose many challenges for traditional machine learning algorithms. In this thesis, we have proposed novel extensions to XCS to meet such challenges. Our contributions are based on the development of informed rule discovering mechanisms that employ gene masks, co-evolving sub-populations and parallel and distributed architectures. The experimental results have demonstrated that the proposed models have better performance in terms of accuracy, execution costs and promoting population diversity across the data sets considered – benchmark low dimensional data sets; synthetic data sets and high-dimensional gene expression profile data sets. There are further opportunities to extend the proposed models in interesting ways, including: examining feature interactions more closely and exploring the effectiveness of alternative evolutionary and statistical models.