School of Mathematics and Statistics - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Resolving some high dimensional problems in bioinformatics
    Rodrigues, Sabrina ( 2016)
    "It has been claimed and demonstrated that many of the conclusions drawn from biomedical research are probably false" (Button et al, 2013). ``Toth et al. (2010) make a common mistake of seeing structure where none exists" Chowdhury et al. (2015). "...the performance of their particular set of 150 probe-sets does not stand out compared to that of randomly sampled sets of 150 probe-sets from the same array” Jacob et al. (2016). These are statements that have emerged over the last few years in the biomedical and bioinformatics literature. Many of the studies in bioinformatics fall into the “small n, large p” category, where numerous statistical difficulties emerge. Rather than identifying studies that have false results, this thesis provides a unifying, theoretical and computational solution to two high dimensional problems that arise in these studies: the problem of variable selection-dimension reduction; and, that of sample size determination. We first consider these problems separately and from a theoretical viewpoint. Next we show how our methods can be applied to different applications in bioinformatics. For the first problem, we revisit and explain the concepts of central subspaces and conditional independence to develop a novel matrix decomposition - the ST matrix decomposition. This decomposition forms the basis of our two new computational methods that involve covariance matrices and covariance operators, the stMVN and the stKDR methods, respectively. For the second problem, we review the Kullback-Leibler divergence and use independent mixture models to develop a new formula for sample size determination for the high dimensional setting where p>>n. We then combine the solutions to these two problems to demonstrate through simple simulations, that when sample sizes are determined according to our formula, it is possible to detect the true predictors that are associated with a response variable. Finally, we show how our unifying solution can be applied in practice to several bioinformatics studies.