|dc.description.abstract||A Genome Wide Association Study(GWAS) aims to find genetic variants that are associated with a trait of interest. GWAS is a typical “great m, small n” in which the number (m) of genetic variants being observed in genome is much larger than the number (n) of individuals in the sample. Performing a successful GWAS is challenging both statistically and computationally. This is due to at least the following complications involved in GWAS: confounding stratification in population structure; strong correlations among genetic variants; and analytical difficulty in model selection under ultra-high dimensionality setting. In this thesis we have developed three statistical methods to tackle these complications.
The first method is rank stability selection (RSS), which is developed in Chapter 2. RSS essentially gives a sub-sampling distribution-free implementation for multiple testing, a method widely used in detecting phenotype-associated genetic variants. Classical multiple testing calculates and tests the association with the phenotype for each genetic variant (covariate), mostly based on the asymptotic distribution of the involved test statistic and ignoring the possible stratification confounding in population structure. This may result in inaccurate family-wise control of both types of I and II errors. The RSS method assesses the association significance of each genetic variant based on its stability in ranking in regard to an association test statistic that is computed for all genetic variants and over a number of sub-samples of the data. It has been shown that RSS under weak regularity assumptions gives accurate and robust (against population stratification) detection of significant genetic variants with respect to their associations with the phenotype. The association test statistic involved can be formed using a variety of association measures such as correlation, mutual information and that based on regression models, etc. The sub-sampling involved is similar to data permutation but has better family-wise error control and is computationally more efficient. An important regularity assumption for RSS is the test statistics for non-associated genetic covariates are exchangeable. This is weaker than the i.i.d. one used in classical multiple testing or permutation testing.
The second method we have developed and detailed in Chapter 3 is a dimension reduction method which is used to maximally reduce the number of tests in multiple testing without compromising the test power but having the involved family-wise type I error rate (FWER) well controlled. Note that Bonferroni correction is the main method currently used to adjust P-values in multiple testing, which controls FWER in a more likely conservative manner. Our dimension reduction method takes a very different approach. The method adopts the rationale of Independent Components Analysis (ICA), and assumes that the association effects of all SNPs on the phenotype are realized through those from independent components. Each independent component is statistically determined from a haplotype block, i.e. a set of SNPs deemed to fall into a linkage-disequilibrium (LD) region in genome. In doing this way, we are able to remove large amount of redundant information from the data, so that multiple testing based on independent components can achieve the desired power and FWER control. We applied our dimension reduction method to a subset of the iCOGS breast cancer data, where the individuals all have European ancestors and have observations from 210,935 SNPs. We were able to draw out 57,403 independent components (ICs) from the SNPs. We then applied multiple testing to these ICs, resulting in finding 26 extra loci associated with breast cancer apart from those found by standard GWAS and in the literature.
Our third method, detailed in Chapter 4, is motivated by predicting disease risk by the effecting SNPs. This requires the development and use of an advanced variable or model selection criterion scalable to high-dimensional data. In the presence of millions of predictive genetic variants in GWAS, classical model selection criteria such as AIC and BIC would most likely fail to select the correct model but tend to overfit. We propose a novel model selection criterion, called of Stochastic Complexity Criterion correction (SCCc), which is derived by applying the Minimum Description Length(MDL) principle. Under ultra-high dimensional settings, SCCc uses a sparse formulation for non-zero coefficients of the variables in the model and adds an additional penalty term to the criterion that equals to the code length of the sparse formulation. We have shown that SCCc is model selection consistent and by simulation it has a stable FWER in finite samples.||en_US