School of Mathematics and Statistics - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Resolving some high dimensional problems in bioinformatics
    Rodrigues, Sabrina ( 2016)
    "It has been claimed and demonstrated that many of the conclusions drawn from biomedical research are probably false" (Button et al, 2013). ``Toth et al. (2010) make a common mistake of seeing structure where none exists" Chowdhury et al. (2015). "...the performance of their particular set of 150 probe-sets does not stand out compared to that of randomly sampled sets of 150 probe-sets from the same array” Jacob et al. (2016). These are statements that have emerged over the last few years in the biomedical and bioinformatics literature. Many of the studies in bioinformatics fall into the “small n, large p” category, where numerous statistical difficulties emerge. Rather than identifying studies that have false results, this thesis provides a unifying, theoretical and computational solution to two high dimensional problems that arise in these studies: the problem of variable selection-dimension reduction; and, that of sample size determination. We first consider these problems separately and from a theoretical viewpoint. Next we show how our methods can be applied to different applications in bioinformatics. For the first problem, we revisit and explain the concepts of central subspaces and conditional independence to develop a novel matrix decomposition - the ST matrix decomposition. This decomposition forms the basis of our two new computational methods that involve covariance matrices and covariance operators, the stMVN and the stKDR methods, respectively. For the second problem, we review the Kullback-Leibler divergence and use independent mixture models to develop a new formula for sample size determination for the high dimensional setting where p>>n. We then combine the solutions to these two problems to demonstrate through simple simulations, that when sample sizes are determined according to our formula, it is possible to detect the true predictors that are associated with a response variable. Finally, we show how our unifying solution can be applied in practice to several bioinformatics studies.
  • Item
    Thumbnail Image
    Transforms and truncations of time series
    Beaumont, Adrian N. ( 2015)
    A time series can be defined as a collection of random variables indexed according to the order they are obtained in time. Examples of time series are monthly Australian retail sales, or quarterly GDP data. Forecasting of time series is generally considered much more important than fitting. Models that use exponential smoothing methods have been found to perform well on time series. Chapter 2 describes the estimation and forecasting procedure of additive forms of time series models; these include the local level model, local trend model, damped trend model, and seasonal equivalents. This chapter also briefly discusses some other time series methods, and introduces the M3-competition data that is extensively used in this thesis. Models that include multiplicative components for time series are considered in Chapter 3, increasing the total number of possible models from 6 to 30. While multiplicative models are often better than purely additive models, model selection methods using all combinations of multiplicative and additive models are found to be no better statistically than just selecting using the purely additive models; model selection methods are confused by the large number of possible models. In this thesis, transforms and truncations are used with exponential smoothing, in the quest for better forecasts of time series. Two types of transforms are explored: those applied directly to a time series; and those applied indirectly, to the prediction errors. The various transforms are tested on a large number of time series from the M3-competition data, and analysis of variance (ANOVA) is applied to the results. We find that the non-transformed time series is significantly worse than some transforms on the monthly data, and on a distribution-based performance measure for both annual and quarterly data. To try to understand why the transforms perform as they do, a simulation study was carried out, using simulations from a paper on outliers. Three types of simulations were used: a Level Shift permanently shifts the series to a new level; an Additive Outlier increases the series for only one time period; and a Transitory Change gradually reverts the series to the old level after the jump point. The non-transformed time series were significantly worse than some transforms on some simulation types. Truncations are applied so that there is no possibility of obtaining an observation below zero on a positive-definite time series. There are two types of truncations: those applied only to the forecasts, and those applied to the fits and forecasts. By using the same methods as for the transforms, we found that the truncations worked better when applied only to the forecasts, but the non-truncated model was never significantly worse than any truncation. Chapter 7 combines transforms with truncations. We find that applying the heteroscedastic state space transform with a truncated normal significantly improved forecasts over the non-transformed results. The final chapter of this thesis investigates how various properties of time series affect the forecasting performance. Of particular interest is the finding that a measure commonly used to assess prediction performance is flawed.
  • Item
    Thumbnail Image
    The statistical analysis of high-throughput assays for studying DNA methylation
    HICKEY, PETER ( 2015)
    DNA methylation is an epigenetic modification that plays an important role in X-chromosome inactivation, genomic imprinting and the repression of repetitive elements in the genome. It must be tightly regulated for normal mammalian development and aberrant DNA methylation is strongly associated with many forms of cancer. This thesis examines the statistical and computational challenges raised by high-throughput assays of DNA methylation, particularly the current gold standard assay of whole-genome bisulfite-sequencing. Using whole-genome bisulfite-sequencing, we can now measure DNA methylation at individual nucleotides across entire genomes. These experiments produce vast amounts of data that require new methods and software to analyse. The first half of the thesis outlines the biological questions of interest in studying DNA methylation, the bioinformatics analysis of these data, and the statistical questions we seek to address. In discussing these bioinformatics challenges, we develop software to facilitate novel analyses of these data. We pay particular attention to analyses of methylation patterns along individual DNA fragments, a novel feature of sequencing-based assays. The second half of the thesis focuses on co-methylation, the spatial dependence of DNA methylation along the genome. We demonstrate that previous analyses of co-methylation have been limited by inadequate data and deficiencies in the applied statistical methods. This motivates a study of co-methylation from 40 whole-genome bisulfite-sequencing samples. These 40 samples represent a diverse range of tissues, from embryonic and induced pluripotent stem cells, through to somatic cells and tumours. Making use of software developed in the first half of the thesis, we explore different measures of co-methylation and relate these to one another. We identify genomic features that influence co-methylation and how it varies between different tissues. In the final chapter, we develop a framework for simulating whole-genome bisulfite-sequencing data. Simulation software is valuable when developing new analysis methods since it can generate data on which to assess the performance of the method and benchmark it against competing methods. Our simulation model is informed by our analyses of the 40 whole-genome bisulfite-sequencing samples and our study of co-methylation.