Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Digital curation of genetic variation data: defining and improving a methodology
    Smith, Timothy David ( 2016)
    Databases of genetic variation are important tools for both medical research and clinical practice. They provide an insight into the frequency of variations occurring within a population and can inform the diagnosis, prognosis and clinical management of a patient. We are at a point where it is easier to generate information than it is to consume it. Genetic variation databases provide an easy means of entry into a highly specific set of information that is of tremendous use. The curation of these databases, or the management and transformation of data by a human expert, plays a vital role in this utility. However, there is no standard definition of the role of a curator or the tasks they perform. Descriptions of the curation activity in the literature are fragmented and incomplete, and fail to specify in sufficient detail what curation is or what it should achieve. Contrast this with the field of theory and practice that focusses on the management of digital information: digital curation. Here, much work has been done to define not only the purpose of curation, but general frameworks for the practice of digital curation. This research therefore seeks to answer the question of whether a standardised methodology for genetic variation database curation can be developed, and, once that methodology has been developed, whether the curation process and its outputs can be improved through the addition of processes, techniques and theories from the field of digital curation. In the course of this thesis, we undertake a structured examination of how the curation of genetic variation databases is described, perceived and practised, and, drawing on the practice of Soft Systems Methodology, build a set of purposeful activity models to model the activity at various stages. From these models, we develop a methodology for curating genetic variation databases for the purpose of providing information that is fit for use for users seeking research-grade data from diagnostic purposes. Building on this methodology, we are able to use the results of our brief investigation of the field of digital curation to propose a number of improvements to our methodology that introduce the notion of care for data that is missing from current genetic variation database curation practices.
  • Item
    Thumbnail Image
    Scalable approaches for analysis of human genome-wide expression and genetic variation data
    ABRAHAM, GAD ( 2012)
    One of the major tasks in bioinformatics and computational biology is prediction of phenotype from molecular data. Predicting phenotypes such as disease opens the way to better diagnostic tools, potentially identifying disease earlier than would be detectable using other methods. Examining molecular signatures rather than clinical phenotypes may help refine disease classification and prediction procedures, since many diseases are known to have multiple molecular subtypes with differing etiology, prognosis, and treatment options. Beyond prediction itself, identifying predictive markers aids our understanding of the biological mechanisms underlying phenotypes such as disease, generating hypotheses that can be tested in the lab. The aims of this thesis are to develop effective and efficient computational and statistical tools for analysing large scale gene expression and genetic datasets, with an emphasis on predictive models. Several key challenges include high dimensionality of the data, which has important statistical and computational implications, noisy data due to measurement error and stochasticity of the underlying biology, and maintaining biological interpretability without sacrificing predictive performance. We begin by examining the problem of predicting breast cancer metastasis and relapse from gene expression data. We present an alternative approach based on gene set statistics. Second, we address the problem of analysing large human case/control genetic (single nucleotide polymorphism) data, and present an efficient and scalable algorithm for fitting sparse models to large datasets. Third, we apply sparse models to genetic case/control datasets from eight complex human diseases, evaluating how each one can be predicted from genotype. Fourth, we apply sparse models lasso methods to a multi-omic dataset consisting of genetic variation, gene expression, and serum metabolites, for reconstruction of genetic regulatory networks. Finally, we propose a novel multi-task statistical approach, intended for modelling multiple correlated phenotypes. In summary, this thesis discusses a range of predictive models and applies them to a wide range of problems, including gene expression, genetic, and multi-omic datasets. We demonstrate that such models, and particularly sparse models, are computationally feasible and can scale to large datasets, provide increased insight into the biological causes of disease, and for some diseases have high predictive performance, allowing high-confidence disease diagnosis to be made based on genetic data.