Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Building better predictive models for health-related outcomes
    Kankanige, Yamuna ( 2017)
    Predicting health-related outcomes is important for developing decision support systems for assisting clinicians and other healthcare workers regularly faced with critical decisions. Such models will save time, help to manage healthcare resources and ultimately provide better quality of care for patients. These outcomes are now made possible thanks to complex medical data routinely generated at hospitals and laboratories, and developments in data mining methods. This thesis focusses on development of such decision support systems as well as techniques for improving the data, such as feature selection and acquisition, generically useful for building better prognostic models for predicting health-related outcomes. Data mining in healthcare is an interesting and unique domain. The data available is heterogeneous, including demographic and diagnostic information of the patients, clinical notes, medical imaging results and whole genome sequence data. Since most data is not collected for research purposes, there can be issues with data quality such as missing information, ambiguous and erroneous data. In addition, some data might not be available in electronic format, which makes it time consuming to collect. Missing values is a big problem in this domain which occurs not only due to data entry or collection issues. Some information is just not available for some records. For example, different pathology test results available for a patient depend on laboratory tests ordered by the clinician for that patient. Another aspect of data mining in healthcare is that these models need to be sufficiently transparent for users to trust and use them. Therefore, techniques/algorithms that can be used for such models is subjective to how much trust users have on those methods. In particular, it is imperative that data analysis on healthcare data generalizes. The topic of this thesis, building better predictive models for health-related data, can be divided roughly to two parts. The first part investigates various data mining techniques used to improve the performance of prediction models, especially with regards to healthcare data, which helps to build better prognostic models for health-related outcomes. The second part of the thesis concerns applications of data mining models on clinical and biomedical data, to provide better health-related outcomes. A common occurrence for classification at test time, is partial missing test case features. Since obtaining all missing features is rarely cost effective or even feasible, identifying and acquiring those features that are most likely to improve prediction accuracy is of significant impact. This challenge arises frequently in health data, where clinicians order only a subset of test panels on a patient, at a time. In this thesis, we propose a confidence-based solution to this generic scenario using random forests. We sequentially suggest the features that are most likely to improve the prediction accuracy of each test instance, using a set of existing training instances which may themselves suffer missing values. Density based logistic regression is a recently introduced classification technique, which has been successful in real clinical settings, that performs one-to-one non-linear transformation of the original feature space to another feature space based on density estimations. This new feature space is particularly well suited for learning a logistic regression model, a popular technique for predicting health-related outcomes. Whilst performance gains, good interpretability and time efficiency make density based logistic regression attractive, there exist limitations to its formulation. As another technique for improving features, we tackle these limitations of the feature transformation method and propose several new extensions in this thesis. Liver transplants are a common type of organ transplantation, second only to kidney transplantations in frequency. The ability to predict organ failure or primary non-function, at liver transplant decision time, facilitates utilization of scarce resource of donor livers, while ensuring that patients who are urgently in need of a liver transplant are prioritized. An index that is derived to predict organ failure using donor as well as recipient characteristics, based on local datasets, is of benefit in the Australian context. In a study using real liver transplant data, we propose that by using donor, transplant and recipient characteristics which are known at decision time of a transplantation, with data mining techniques, we can achieve high accuracy in matching donors and recipients, potentially providing better organ survival outcomes. Serotyping is a common bacterial typing process where isolated microorganism samples are grouped according to their distinctive surface structures called antigens, which is important for public health and epidemiological surveillance. In a study using whole genome sequencing data of four publicly available Streptococcus Pneumoniae datasets, we demonstrate that data mining approaches can be used to predict the serotypes of isolates faster and accurately when compared with the traditional approaches. In summary, this thesis focusses on techniques for improving data, such as feature selection, transformation and acquisition, generically useful for building better prognostic models for predicting health-related outcomes as well as applications of data mining techniques on clinical and biomedical data for improving health-related outcomes.
  • Item
    Thumbnail Image
    Scalable approaches for analysis of human genome-wide expression and genetic variation data
    ABRAHAM, GAD ( 2012)
    One of the major tasks in bioinformatics and computational biology is prediction of phenotype from molecular data. Predicting phenotypes such as disease opens the way to better diagnostic tools, potentially identifying disease earlier than would be detectable using other methods. Examining molecular signatures rather than clinical phenotypes may help refine disease classification and prediction procedures, since many diseases are known to have multiple molecular subtypes with differing etiology, prognosis, and treatment options. Beyond prediction itself, identifying predictive markers aids our understanding of the biological mechanisms underlying phenotypes such as disease, generating hypotheses that can be tested in the lab. The aims of this thesis are to develop effective and efficient computational and statistical tools for analysing large scale gene expression and genetic datasets, with an emphasis on predictive models. Several key challenges include high dimensionality of the data, which has important statistical and computational implications, noisy data due to measurement error and stochasticity of the underlying biology, and maintaining biological interpretability without sacrificing predictive performance. We begin by examining the problem of predicting breast cancer metastasis and relapse from gene expression data. We present an alternative approach based on gene set statistics. Second, we address the problem of analysing large human case/control genetic (single nucleotide polymorphism) data, and present an efficient and scalable algorithm for fitting sparse models to large datasets. Third, we apply sparse models to genetic case/control datasets from eight complex human diseases, evaluating how each one can be predicted from genotype. Fourth, we apply sparse models lasso methods to a multi-omic dataset consisting of genetic variation, gene expression, and serum metabolites, for reconstruction of genetic regulatory networks. Finally, we propose a novel multi-task statistical approach, intended for modelling multiple correlated phenotypes. In summary, this thesis discusses a range of predictive models and applies them to a wide range of problems, including gene expression, genetic, and multi-omic datasets. We demonstrate that such models, and particularly sparse models, are computationally feasible and can scale to large datasets, provide increased insight into the biological causes of disease, and for some diseases have high predictive performance, allowing high-confidence disease diagnosis to be made based on genetic data.
  • Item
    Thumbnail Image
    Online time series approximation and prediction
    Xu, Zhenghua ( 2012)
    In recent years, there are rapidly increasing research interests in the management of time series data due to its importance in a variety of applications such as network traffic management, telecommunications, finance, sensor network and location based services. In this thesis, we focus on two important problems of the management of time series data: the online time series approximation problem and the online time series prediction problem. The time series approximation can reduce the space and the computational cost of storing and transmitting time series data, and also reduce the workload of the data processing. Segmentation is one of the most commonly used methods to meet this requirement. However, while most of the current segmentation methods aim to minimize the holistic error between the approximation and the original time series, few works try to represent time series as compact as possible with an error bound guarantee on each data point. Moreover, in many real world situations, the patterns of the time series do not follow a constant rule such that using only one type of functions may not yield the best compaction. Motivated by these observations, we propose an online segmentation algorithm which approximates time series by a set of different types of candidate functions (poly nomials of different orders, exponential functions, etc.) and adaptively chooses the most compact one as the pattern of the time series changes. A challenge in this approach is to determine the approximation function on the fly (“online”). Thereby, we further present a novel method to efficiently generate the compact approximation of a time series in an online fashion for several types of candidate functions. This method incrementally narrows the feasible coefficient spaces of candidate functions in coefficient coordinate systems such that it can make each segment as long as possible given an error bound on each data point. Extensive experimental results show that our algorithm generates more compact approximations of the time series with lower average errors than the state-of-the-art algorithm. The time series prediction aims to predict future values of the time series according to their previously observed values. In this thesis, we focus our work on a specific branch of the time series prediction: the online locational time series prediction problem, i.e, predicting the final locations of the locational time series according to their current observed locational time series values on the fly. This problem is also called the destination prediction problem and the systems used to solve this problem are the called destination prediction systems. However, most of current destination prediction systems are based on the traveling histories of specific users so they are too ad hoc to accurately predict other users’ destination. In our work, we propose the first generic destination prediction solution to provide the accurate destination prediction services to all users based on the given user queries (i.e., the users’ current partial traveling trajectories) without knowing any traveling histories of the users. Extensive experiments validate that the prediction result of our method is more accurate than that of the competing method.