Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Scalable approaches for analysis of human genome-wide expression and genetic variation data
    ABRAHAM, GAD ( 2012)
    One of the major tasks in bioinformatics and computational biology is prediction of phenotype from molecular data. Predicting phenotypes such as disease opens the way to better diagnostic tools, potentially identifying disease earlier than would be detectable using other methods. Examining molecular signatures rather than clinical phenotypes may help refine disease classification and prediction procedures, since many diseases are known to have multiple molecular subtypes with differing etiology, prognosis, and treatment options. Beyond prediction itself, identifying predictive markers aids our understanding of the biological mechanisms underlying phenotypes such as disease, generating hypotheses that can be tested in the lab. The aims of this thesis are to develop effective and efficient computational and statistical tools for analysing large scale gene expression and genetic datasets, with an emphasis on predictive models. Several key challenges include high dimensionality of the data, which has important statistical and computational implications, noisy data due to measurement error and stochasticity of the underlying biology, and maintaining biological interpretability without sacrificing predictive performance. We begin by examining the problem of predicting breast cancer metastasis and relapse from gene expression data. We present an alternative approach based on gene set statistics. Second, we address the problem of analysing large human case/control genetic (single nucleotide polymorphism) data, and present an efficient and scalable algorithm for fitting sparse models to large datasets. Third, we apply sparse models to genetic case/control datasets from eight complex human diseases, evaluating how each one can be predicted from genotype. Fourth, we apply sparse models lasso methods to a multi-omic dataset consisting of genetic variation, gene expression, and serum metabolites, for reconstruction of genetic regulatory networks. Finally, we propose a novel multi-task statistical approach, intended for modelling multiple correlated phenotypes. In summary, this thesis discusses a range of predictive models and applies them to a wide range of problems, including gene expression, genetic, and multi-omic datasets. We demonstrate that such models, and particularly sparse models, are computationally feasible and can scale to large datasets, provide increased insight into the biological causes of disease, and for some diseases have high predictive performance, allowing high-confidence disease diagnosis to be made based on genetic data.