School of Mathematics and Statistics - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Resolving some high dimensional problems in bioinformatics
    Rodrigues, Sabrina ( 2016)
    "It has been claimed and demonstrated that many of the conclusions drawn from biomedical research are probably false" (Button et al, 2013). ``Toth et al. (2010) make a common mistake of seeing structure where none exists" Chowdhury et al. (2015). "...the performance of their particular set of 150 probe-sets does not stand out compared to that of randomly sampled sets of 150 probe-sets from the same array” Jacob et al. (2016). These are statements that have emerged over the last few years in the biomedical and bioinformatics literature. Many of the studies in bioinformatics fall into the “small n, large p” category, where numerous statistical difficulties emerge. Rather than identifying studies that have false results, this thesis provides a unifying, theoretical and computational solution to two high dimensional problems that arise in these studies: the problem of variable selection-dimension reduction; and, that of sample size determination. We first consider these problems separately and from a theoretical viewpoint. Next we show how our methods can be applied to different applications in bioinformatics. For the first problem, we revisit and explain the concepts of central subspaces and conditional independence to develop a novel matrix decomposition - the ST matrix decomposition. This decomposition forms the basis of our two new computational methods that involve covariance matrices and covariance operators, the stMVN and the stKDR methods, respectively. For the second problem, we review the Kullback-Leibler divergence and use independent mixture models to develop a new formula for sample size determination for the high dimensional setting where p>>n. We then combine the solutions to these two problems to demonstrate through simple simulations, that when sample sizes are determined according to our formula, it is possible to detect the true predictors that are associated with a response variable. Finally, we show how our unifying solution can be applied in practice to several bioinformatics studies.
  • Item
    Thumbnail Image
    The statistical analysis of high-throughput assays for studying DNA methylation
    HICKEY, PETER ( 2015)
    DNA methylation is an epigenetic modification that plays an important role in X-chromosome inactivation, genomic imprinting and the repression of repetitive elements in the genome. It must be tightly regulated for normal mammalian development and aberrant DNA methylation is strongly associated with many forms of cancer. This thesis examines the statistical and computational challenges raised by high-throughput assays of DNA methylation, particularly the current gold standard assay of whole-genome bisulfite-sequencing. Using whole-genome bisulfite-sequencing, we can now measure DNA methylation at individual nucleotides across entire genomes. These experiments produce vast amounts of data that require new methods and software to analyse. The first half of the thesis outlines the biological questions of interest in studying DNA methylation, the bioinformatics analysis of these data, and the statistical questions we seek to address. In discussing these bioinformatics challenges, we develop software to facilitate novel analyses of these data. We pay particular attention to analyses of methylation patterns along individual DNA fragments, a novel feature of sequencing-based assays. The second half of the thesis focuses on co-methylation, the spatial dependence of DNA methylation along the genome. We demonstrate that previous analyses of co-methylation have been limited by inadequate data and deficiencies in the applied statistical methods. This motivates a study of co-methylation from 40 whole-genome bisulfite-sequencing samples. These 40 samples represent a diverse range of tissues, from embryonic and induced pluripotent stem cells, through to somatic cells and tumours. Making use of software developed in the first half of the thesis, we explore different measures of co-methylation and relate these to one another. We identify genomic features that influence co-methylation and how it varies between different tissues. In the final chapter, we develop a framework for simulating whole-genome bisulfite-sequencing data. Simulation software is valuable when developing new analysis methods since it can generate data on which to assess the performance of the method and benchmark it against competing methods. Our simulation model is informed by our analyses of the 40 whole-genome bisulfite-sequencing samples and our study of co-methylation.