School of Mathematics and Statistics - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Resolving some high dimensional problems in bioinformatics
    Rodrigues, Sabrina ( 2016)
    "It has been claimed and demonstrated that many of the conclusions drawn from biomedical research are probably false" (Button et al, 2013). ``Toth et al. (2010) make a common mistake of seeing structure where none exists" Chowdhury et al. (2015). "...the performance of their particular set of 150 probe-sets does not stand out compared to that of randomly sampled sets of 150 probe-sets from the same array” Jacob et al. (2016). These are statements that have emerged over the last few years in the biomedical and bioinformatics literature. Many of the studies in bioinformatics fall into the “small n, large p” category, where numerous statistical difficulties emerge. Rather than identifying studies that have false results, this thesis provides a unifying, theoretical and computational solution to two high dimensional problems that arise in these studies: the problem of variable selection-dimension reduction; and, that of sample size determination. We first consider these problems separately and from a theoretical viewpoint. Next we show how our methods can be applied to different applications in bioinformatics. For the first problem, we revisit and explain the concepts of central subspaces and conditional independence to develop a novel matrix decomposition - the ST matrix decomposition. This decomposition forms the basis of our two new computational methods that involve covariance matrices and covariance operators, the stMVN and the stKDR methods, respectively. For the second problem, we review the Kullback-Leibler divergence and use independent mixture models to develop a new formula for sample size determination for the high dimensional setting where p>>n. We then combine the solutions to these two problems to demonstrate through simple simulations, that when sample sizes are determined according to our formula, it is possible to detect the true predictors that are associated with a response variable. Finally, we show how our unifying solution can be applied in practice to several bioinformatics studies.
  • Item
    Thumbnail Image
    The statistical analysis of high-throughput assays for studying DNA methylation
    HICKEY, PETER ( 2015)
    DNA methylation is an epigenetic modification that plays an important role in X-chromosome inactivation, genomic imprinting and the repression of repetitive elements in the genome. It must be tightly regulated for normal mammalian development and aberrant DNA methylation is strongly associated with many forms of cancer. This thesis examines the statistical and computational challenges raised by high-throughput assays of DNA methylation, particularly the current gold standard assay of whole-genome bisulfite-sequencing. Using whole-genome bisulfite-sequencing, we can now measure DNA methylation at individual nucleotides across entire genomes. These experiments produce vast amounts of data that require new methods and software to analyse. The first half of the thesis outlines the biological questions of interest in studying DNA methylation, the bioinformatics analysis of these data, and the statistical questions we seek to address. In discussing these bioinformatics challenges, we develop software to facilitate novel analyses of these data. We pay particular attention to analyses of methylation patterns along individual DNA fragments, a novel feature of sequencing-based assays. The second half of the thesis focuses on co-methylation, the spatial dependence of DNA methylation along the genome. We demonstrate that previous analyses of co-methylation have been limited by inadequate data and deficiencies in the applied statistical methods. This motivates a study of co-methylation from 40 whole-genome bisulfite-sequencing samples. These 40 samples represent a diverse range of tissues, from embryonic and induced pluripotent stem cells, through to somatic cells and tumours. Making use of software developed in the first half of the thesis, we explore different measures of co-methylation and relate these to one another. We identify genomic features that influence co-methylation and how it varies between different tissues. In the final chapter, we develop a framework for simulating whole-genome bisulfite-sequencing data. Simulation software is valuable when developing new analysis methods since it can generate data on which to assess the performance of the method and benchmark it against competing methods. Our simulation model is informed by our analyses of the 40 whole-genome bisulfite-sequencing samples and our study of co-methylation.
  • Item
    Thumbnail Image
    Empirical bayes modelling of expression profiles and their associations
    PHIPSON, BELINDA ( 2013)
    New biotechnology developments such as the microarray, and more recently, next generation sequencing, have necessitated the need for new statistical methodologies to be developed. These methods are designed to combat unique issues present in the data generated by these technologies. They provide the perfect environment for information sharing strategies, such as empirical Bayes methods, due to the large numbers of simulataneous tests performed. We explore different estimators of the proportion of true null hypotheses and develop a fast and accurate estimator which is valid for any number of p-values. This estimator is based on local false discovery rates and is used in several of the proceeding sections. Another interest is in developing robust hyper-parameter estimators in an empirical Bayes hierarchical model setting. An estimator for the prior degrees of freedom which is robust to outliers is developed using two different approaches. This has the effect that highly variable genes are unlikely to be significantly differentially expressed, as well as increasing power to detect differential expression. The second half of the thesis focuses on gaining more information from the log fold changes obtained from microarray and sequencing experiments. More accurate log fold changes are developed for microarrays and RNA sequencing data, which provide additional information for ranking top differentially expressed genes. The new measure, called predictive log fold change, arises from the posterior distribution of the log fold changes. The relationship between two gene expression profiles is quantified when the p-values obtained from testing two hypotheses are not independent. This arises when two genotypes are compared to a common control group. The method is based on separating the true biological correlation from the technical correlation of the log fold changes. The hyperparameters of the prior distribution for the log fold changes need to be estimated in order to get an estimate of the biological correlation. This is possible since we show that the two dependent moderated t statistics have a scaled multivariate t distribution. The methods developed in this thesis are tested using simulations and applied to data sets collected in collaboration with biologists at The Walter and Eliza Hall Institute of Medical Research.