Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 8 of 8
  • Item
    Thumbnail Image
    Cluster validation and discovery of multiple clusterings
    Lei, Yang ( 2016)
    Cluster analysis is an important unsupervised learning process in data analysis. It aims to group data objects into clusters, so that the data objects in the same group are more similar and the data objects in different groups are more dissimilar. There are many open challenges in this area. In this thesis, we focus on two: discovery of multiple clusterings and cluster validation. Many clustering methods focus on discovering one single ‘best’ solution from the data. However, data can be multi-faceted in nature. Particularly when datasets are large and complex, there may be several useful clusterings existing in the data. In addition, users may be seeking different perspectives on the same dataset, requiring multiple clustering solutions. Multiple clustering analysis has attracted considerable attention in recent years and aims to discover multiple reasonable and distinctive clustering solutions from the data. Many methods have been proposed on this topic and one popular technique is meta-clustering. Meta-clustering explores multiple reasonable and distinctive clusterings by analyzing a large set of base clusterings. However, there may exist poor quality and redundant base clustering which will affect the generation of high quality and diverse clustering views. In addition, the generated clustering views may not all be relevant. It will be time and energy consuming for users to check all the returned solutions. To tackle these problems, we propose a filtering method and a ranking method to achieve higher quality and more distinctive clustering solutions. Cluster validation refers to the procedure of evaluating the quality of clusterings, which is critical for clustering applications. Cluster validity indices (CVIs) are often used to quantify the quality of clusterings. They can be generally classified into two categories: external measures and internal measures, which are distinguished in terms of whether or not external information is used during the validation procedure. In this thesis, we focus on external cluster validity indices. There are many open challenges in this area. We focus two of them: (a) CVIs for fuzzy clusterings and, (b) Bias issues for CVIs. External CVIs are often used to quantify the quality of a clustering by comparing it against the ground truth. Most external CVIs are designed for crisp clusterings (one data object only belongs to one single cluster). How to evaluate the quality of soft clusterings (one data object can belong to more than one cluster) is a challenging problem. One common way to achieve this is by hardening a soft clustering to a crisp clustering and then evaluating it using a crisp CVI. However, hardening may cause information loss. To address this problem, we generalize a class of popular information-theoretic based crisp external CVIs to directly evaluate the quality of soft clusterings, without the need for a hardening step. There is an implicit assumption when using external CVIs for evaluating the quality of a clustering, that is, they work correctly. However, if this assumption does not hold, then misleading results might occur. Thus, identifying and understanding the bias behaviors of external CVIs is crucial. Along these lines, we identify novel bias behaviors of external CVIs and analyze the type of bias both theoretically and empirically.
  • Item
    Thumbnail Image
    Design and adjustment of dependency measures
    Romano, Simone ( 2015)
    Dependency measures are fundamental for a number of important applications in data mining and machine learning. They are ubiquitously used: for feature selection, for clustering comparisons and validation, as splitting criteria in random forest, and to infer biological networks, to list a few. More generally, there are three important applications of dependency measures: detection, quantification, and ranking of dependencies. Dependency measures are estimated on finite data sets and because of this the tasks above become challenging. This thesis proposes a series of contributions to improve performances on each of these three goals. When differentiating between strong and weak relationships using information theoretic measures, the variance plays an important role: the higher the variance, the lower the chance to correctly rank the relationships. In this thesis, we discuss the design of a dependency measure based on the normalized mutual information whose estimation is based on many random discretization grids. This approach allows us to reduce its estimation variance. We show that a small estimation variance for the grid estimator of mutual information if beneficial to achieve higher power when the task is detection of dependencies between variables and when ranking different noisy dependencies. Dependency measure estimates can be high because of chance when the sample size is small, e.g. because of missing values, or when the dependency is estimated between categorical variables with many categories. These biases cause problems when the dependency must have an interpretable quantification and when ranking dependencies for feature selection. In this thesis, we formalize a framework to adjust dependency measures in order to correct for these biases. We apply our adjustments to existing dependency measures between variables and show how to achieve better interpretability in quantification. For example, when a dependency measure is used to quantify the amount of noise on functional dependencies between variables, we experimentally demonstrate that adjusted measures have more interpretable range of variation. Moreover, we demonstrate that our approach is also effective to rank attributes during the splitting procedure in random forests where a dependency measure between categorical variables is employed. Finally, we apply our framework of adjustments to dependency measures between clusterings. In this scenario, we are able to analytically compute our adjustments. We propose a number of adjusted clustering comparison measures which reduce to well known adjusted measures as special cases. This allows us to propose guidelines for the best applications of our measures as well as for existing ones for which guidelines are missing in literature, e.g. for the Adjusted Rand Index (ARI).
  • Item
    Thumbnail Image
    Recommendation systems for travel destination and departure time
    Xue, Yuan ( 2015)
    People travel on a daily basis to various local destinations such as office, home, restaurant, appointment venue, and sightseeing spot. It is vital to most people that we have a positive experience and high efficiency of daily travel. With this observation, my research strives to provide daily-travel related recommendations by solving two optimisation problems, driving destination prediction and departure time recommendation for appointments. Our “SubSyn” destination prediction algorithm, by definition, predicts potential destinations at real-time for drivers on the road. Its applications include recommending sightseeing places, pushing targeted advertisement, and providing early warnings for road congestion. It employs the Bayesian inference framework and second-order Markov model to compute a list of high-probability destinations. The key contributions include real-time processing and the ability to predict destinations with very limited amount of training data. We also look into the problem of privacy protection against such prediction. The “iTIME” departure time recommendation system is a smart calendar that reminds users to depart in order to arrive at appointment venues on time. It also suggests the best transport mode based on users’ travel history and preferences. Currently, it is very inefficient for people to manually and repeatedly check the departure time and compare all transport modes using, for instance, Google Maps. The functionalities of iTIME were realised by machine learning algorithms that learn users’ habits, analyse the importance of appointments and optimal mode of transport, and estimate the start location and travel time. Our field study showed that we can save up to 40% of time by using iTIME. The system can also be extended easily to provide additional functionalities such as clashing appointments detection and appointment scheduling, both taking into account the predicted start location and travel time of future appointments. Both problems can be categorised as recommender systems (or recommendation systems) that provide insightful suggestions in order to improve daily-travel experiences and efficiency.
  • Item
    Thumbnail Image
    Real-time feedback for surgical simulation using data mining
    Zhou, Yun ( 2014)
    Surgical trainees devote years to master the surgical skills required to safely perform surgery. Traditionally, they refine their psychomotor skills by practising on plastic bones or cadavers under the supervision of expert surgeons. Experts guide trainees through surgical procedures while providing feedback on the quality of their procedure. However, there are limitations to this approach, which include a shortage of cadaver bones, limited availability of expert supervision, and the subjective manner of surgical skill assessment. To address these limitations, the introduction of new techniques such as 3D illusion, haptic feedback and augmented reality have significantly improved the realism of surgical simulators. Such simulators have the potential to provide a cost-effective platform, which allows trainees to practice many surgical cases of varying difficulty, and provides the flexibility of practising repeatedly at their own convenience. However, most simulators lack the automated performance assessment and feedback, which limits the applicability of simulators as self-guided training systems. In thesis, we aim to deliver automated performance assessment and feedback in a virtual simulation environment. The automated performance assessment provides information on the quality of surgical result, which is a critical component for a self-guided training platform. A large number of recent studies have focused on scoring the outcome of surgical tasks. However, this score typically based on the result of a surgical task (such as the shape of a surgical end-product) and ignores the rich information provided by real-time performance attributes, such as motion records. Furthermore, since this assessment is delivered at the end of each task, it does not allow any opportunity to identify and address mistakes as they occur. We propose an event-based framework that provides online assessment with different temporal granularities. The evaluations show that the proposed framework provides accurate performance assessment using both motion records and end-product information. Although automated performance assessment provides expertise score to illustrate the surgical perform, a single score has limited utility in improving surgical technique, which can be equally important. Trainees need constructive human understandable feedback to refine their psychomotor skills. To this end, we propose a Random Forest based approach to generate meaningful automated real-time performance feedback. Our evaluation demonstrates it can significantly improve the surgical techniques. However, this random forest based method makes specific assumptions that all drilling movements made by experts are of "expert quality'' and all operations made by trainees are suboptimal. This hampers the model training process and leads to lower accuracy rates. Therefore we propose a pattern-based approach to capture the differences in technique between experts and trainees and deliver real-time feedback to improve performance, while avoiding the assumption of "polarising'' the quality of drill strokes based on expertise. Our evaluation results show that the proposed approach identifies the stage of the surgical procedure correctly and provides constructive feedback to assist surgical trainees in improving their technique. Another challenge for automated performance assessment is hard to extend existing evaluation models to new specimens. In order to train reliable assessment models for new specimens, the classical machine learning approaches require a new set of human expert examples collected from each new specimen. To eliminate this need, we propose a transfer learning framework to adapt a classifier built on a single specimen to multiple specimens. Once a classifier is trained, we translate the new specimens' features to the original feature space, which allows us to carry out performance evaluation on different specimens using the same classifier. In summary, the major contributions of this thesis involve the development of self-guided training platform for delivering automatic assessment and feedback using data mining techniques.
  • Item
    Thumbnail Image
    Distribution enhanced data mining methods for time series forecasting
    Ristanoski, Goce ( 2014)
    Time series forecasting is an exciting research area whose challenges are to discover patterns from data that has been observed over time. Whether it is stock market variables concerning price behaviour, or weather measurements that can be used to issue a timely warning of an approaching cyclone, or radiation level measurements that can prevent a harmful disaster and save hundreds of lives in a power plant, time series find useful applications in many diverse sciences and disciplines. The behaviour of time series is susceptible to changes, often reflected through distribution features such as mean and variance. Though changes in the series are to be expected, they are of mostly a continuous nature, meaning that predictions to a certain point in the future can be made with a high degree of accuracy. Understanding how these changes affect the prediction accuracy is crucial to minimizing the forecasted error. This thesis investigates utilising information about changes in the series, and presents novel modifications in the learning process based on the knowledge gained about these changes. There are four main contributions presented in this thesis, which deliver innovative techniques for time series analysis by incorporating distribution characteristics. In the first part of the thesis we develop a pre-processing algorithm that uses distribution information which can easily accompany any prediction model we choose to work with. Our algorithm performs a fast and efficient reduction of the training samples depending on the change of the mean of the distribution, leaving a set of samples with a larger concentration of useful information and reduced noise. In the next part of our work, we introduce an intelligent group-based error minimization algorithm, which simultaneously achieves reduction of both mean and variance of the forecasted errors, associated with groups of observations with similar distribution. We demonstrate how this sensitive grouping of samples reduces both the error and variance of the error per group, embodied in a modified linear regression algorithm. We then introduce a modified form of Support Vector Regression that detects potential large-error producing samples, and which penalizes the loss for these samples by using a time-sensitive loss, for directly targeting a reduction of the variance in the forecasted errors. This new approach achieves competitive reduction in the error variance and produces more accurate predictions, with performance better than several state of the art methods. Finally, we apply our technical methodologies for the purpose of discrimination aware learning, where we demonstrate how they can be modified and used in another research context.
  • Item
    Thumbnail Image
    Efficient mining of interesting emerging patterns and their effective use in classification
    FAN, HONGJIAN ( 2004-07)
    Knowledge Discovery in Databases (KDD), or Data Mining is used to discover interesting or useful patterns and relationships in data, with an emphasis on large volume of observational databases. Among many other types of information (knowledge) that can be discovered in data, patterns that are expressed in terms of features are popular because they can be understood and used directly by people. The recently proposed Emerging Pattern (EP) is one type of such knowledge patterns. Emerging Patterns are sets of items (conjunctions of attribute values) whose frequency change significantly from one dataset to another. They are useful as a means of discovering distinctions inherently present amongst a collection of datasets and have been shown to be a powerful method for constructing accurate classifiers. (For complete abstract open document)
  • Item
    Thumbnail Image
    Automated analysis of time lapse microscopy images
    KAN, ANDREY ( 2012)
    Cells are the building blocks of life, and time lapse microscopy is a powerful way to study cells. Automated video acquisition and analysis of cells opens unprecedented opportunities, ranging from building novel mathematical models supported by rich data to automated drug screening. Unfortunately, accurate and completely automated analysis of cell images is a difficult task. Therefore human intervention is often required, for example, for tuning of segmentation and tracking algorithms or correcting the results of automated analysis. In this thesis, we aim to reduce the amount of manual work required, while preserving the accuracy of analysis. Two key tasks in automated analysis are cell segmentation and tracking. Segmentation is the process of locating cell outlines in cell images, while tracking refers to establishing cell identities across subsequent video frames. One of the main challenges of automated analysis is the substantial variability in cell appearance and dynamics across different videos and even within a single video. For example, there can be a few rapidly moving cells in the beginning of a video and a large number of cells stuck in a clump by the end of the video. Such variation has resulted in a large variety of cell segmentation and tracking algorithms. There has been a large body of work on automated cell segmentation and tracking. However, many methods make specific assumptions about cell morphology or dynamics, or involve a number of parameters that a user needs to set manually. This hampers the applicability of such methods across different videos. We first develop portable cell semi-segmentation and segmentation algorithms, where portability is achieved by using a flexible cell descriptor function. We then develop a novel cell tracking algorithm that has only one parameter, and hence can be easily adopted to different videos. Furthermore, we present a parameter-free variation of the algorithm. Our evaluation on real cell videos demonstrates that our algorithms are capable of achieving accurate results and outperforming other existing methods. Even the most sophisticated cell tracking algorithms make errors. A user can be required to manually review the tracking results and correct errors. To this end, we propose a semi-automated tracking framework that is capable of identifying video frames that are likely to contain errors. The user can then look only into these frames and not into all video frames. We find that our framework can significantly reduce the amount of manual work required to review and correct tracking results. Furthermore, in different videos, the most accurate results can be obtained by different methods and different parameter settings. It is often not clear which method should be chosen for a particular video. We address this problem with a novel method for ranking cell tracking systems without manual validation. Our method is capable of ranking cell trackers according to their fitness to a particular video, without the need for manual collection of the ground truth tracks. We simulate practical tracking scenarios and confirm the feasibility of our method. Finally, as an example of a biological assay, we consider evaluating the locomotion of Plasmodium parasites (that cause malaria) with application to automated anti-malaria drug screening. We track live parasites in a matrigel medium and develop a numerical description of parasite tracks. Our experiments show that this description captures changes in the locomotion in response to treatment with the toxin Cytochalasin D. Therefore our description can form a basis for automated drug screening, where various treatments are applied to different cell populations by a robot, and the resulting tracks are evaluated quantitatively. In summary, our thesis makes six major contributions highlighted above. These contributions can reduce the amount of manual work in cell image analysis, while achieving highly accurate results.
  • Item
    Thumbnail Image
    Scalable approaches for analysing high density single nucleotide polymorphism array data
    Wong, Gerard Kum Peng ( 2012)
    Prior to making inferences from the raw data produced by these microarrays, several challenges need to be addressed. First, it is important to limit the impact of noise on microarray measurements while maintaining data integrity. An unexplored aspect of noise is the extent of probeset sequence identity in SNP microarrays. Second, microarray-based datasets often have at least two orders of magnitude more probesets than the number of samples they describe. This poses a challenge for traditional statistical tests when used in this context. Third, the number of features in each dataset is large even when sample sizes are small, thus computationally efficient approaches are required to analyse these datasets. Finally, with improving resolution of SNP arrays, there is a need to exploit this improvement in resolution to identify finer-scaled mutations in the human genome. Most existing approaches deal with these challenges at an individual sample level and do not look for consensus change across the population to identify sites of DNA mutation. Other approaches artificially smooth or segment the data to obtain uniform segments of copy number change, and lose possible fine-scaled copy number changes in the process. Others are based on computationally expensive approaches that do not scale well to array resolution and sample size. Our first contribution is a comprehensive survey of the sequence identity of all probesets for all variants of the Affymetrix Genome-Wide Human SNP array. This survey assesses the target uniqueness of every probeset and provides a basis for the development of a set of gold standard labels of copy number change between genders. The derived sets of gold standard labels are a benchmark for assessing the performance of various algorithms in detecting recurrent copy number change. This benchmark is utilised in the evaluation of our second and third contribution. Our second contribution is a statistical approach called Detecting Recurrent Copy Number Changes Using Rank Order Statistics (DRECS), which is designed to identify regions of consensus copy number change across multiple samples in SNP array datasets. Through the use of rank-based statistics, DRECS efficiently draws on the statistical power of using multiple samples to identify fine-scaled copy number changes down to the width of a single probe in a computationally efficient way. Our third contribution is called Sum of Ranks Exact Test (SoRET), a non-parametric extension of DRECS. SoRET addresses SNP datasets with small sample sizes and makes no assumptions about the distribution from which the data was sampled. Its performance in terms of Type I and Type II errors is superior to competitive parametric and non-parametric statistical tests for small sample sizes. Our fourth contribution is a feature set reduction approach called FSR. FSR enables existing filter-based feature selection approaches to handle large dimensional microarray-type datasets by pruning irrelevant and redundant features. A novel scoring measure is developed to assess the strength of each feature in terms of sample class discrimination. FSR uses measures of entropy to efficiently gauge the contribution of higher order feature patterns to avoid combinatorial explosions in assessing the utility of features. In our tested datasets, classifiers trained on features selected from FSR-reduced feature sets have shown notably better predictive accuracy than classifiers trained on features selected from complete feature sets.