School of Mathematics and Statistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 12
  • Item
    Thumbnail Image
    Population Structure and Cryptic Relatedness in Genetic Association Studies
    Astle, W ; Balding, DJ (INST MATHEMATICAL STATISTICS, 2009-11)
    We review the problem of confounding in genetic association studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a simple ``island'' model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defining the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and estimating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, structured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computational developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree.
  • Item
    Thumbnail Image
    Limit theorems for sequences of random trees
    Balding, D ; Ferrari, PA ; Fraiman, R ; Sued, M (SPRINGER, 2009-08)
    We consider a random tree and introduce a metric in the space of trees to define the ``mean tree'' as the tree minimizing the average distance to the random tree. When the resulting metric space is compact we have laws of large numbers and central limit theorems for sequence of independent identically distributed random trees. As application we propose tests to check if two samples of random trees have the same law.
  • Item
    Thumbnail Image
    EFFICIENT POOLING DESIGNS FOR LIBRARY SCREENING
    BRUNO, WJ ; KNILL, E ; BALDING, DJ ; BRUCE, DC ; DOGGETT, NA ; SAWHILL, WW ; STALLINGS, RL ; WHITTAKER, CC ; TORNEY, DC (ACADEMIC PRESS INC ELSEVIER SCIENCE, 1995-03-01)
    We describe efficient methods for screening clone libraries, based on pooling schemes that we call "random k-sets designs." In these designs, the pools in which any clone occurs are equally likely to be any possible selection of k from the v pools. The values of k and v can be chosen to optimize desirable properties. Random k-sets designs have substantial advantages over alternative pooling schemes: they are efficient, flexible, and easy to specify, require fewer pools, and have error-correcting and error-detecting capabilities. In addition, screening can often be achieved in only one pass, thus facilitating automation. For design comparison, we assume a binomial distribution for the number of "positive" clones, with parameters n, the number of clones, and c, the coverage. We propose the expected number of resolved positive clones--clones that are definitely positive based upon the pool assays--as a criterion for the efficiency of a pooling design. We determine the value of k that is optimal, with respect to this criterion, as a function of v, n, and c. We also describe superior k-sets designs called k-sets packing designs. As an illustration, we discuss a robotically implemented design for a 2.5-fold-coverage, human chromosome 16 YAC library of n = 1298 clones. We also estimate the probability that each clone is positive, given the pool-assay data and a model for experimental errors.
  • Item
    Thumbnail Image
    Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation
    Cornuet, J-M ; Santos, F ; Beaumont, MA ; Robert, CP ; Marin, J-M ; Balding, DJ ; Guillemaud, T ; Estoup, A (OXFORD UNIV PRESS, 2008-12-01)
    UNLABELLED: Genetic data obtained on population samples convey information about their evolutionary history. Inference methods can extract part of this information but they require sophisticated statistical techniques that have been made available to the biologist community (through computer programs) only for simple and standard situations typically involving a small number of samples. We propose here a computer program (DIY ABC) for inference based on approximate Bayesian computation (ABC), in which scenarios can be customized by the user to fit many complex situations involving any number of populations and samples. Such scenarios involve any combination of population divergences, admixtures and population size changes. DIY ABC can be used to compare competing scenarios, estimate parameters for one or more scenarios and compute bias and precision measures for a given scenario and known values of parameters (the current version applies to unlinked microsatellite data). This article describes key methods used in the program and provides its main features. The analysis of one simulated and one real dataset, both with complex evolutionary scenarios, illustrates the main possibilities of DIY ABC. AVAILABILITY: The software DIY ABC is freely available at http://www.montpellier.inra.fr/CBGP/diyabc.
  • Item
    Thumbnail Image
    Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions
    Su, S-Y ; White, J ; Balding, DJ ; Coin, LJM (BMC, 2008-12-01)
    BACKGROUND: The power of haplotype-based methods for association studies, identification of regions under selection, and ancestral inference, is well-established for diploid organisms. For polyploids, however, the difficulty of determining phase has limited such approaches. Polyploidy is common in plants and is also observed in animals. Partial polyploidy is sometimes observed in humans (e.g. trisomy 21; Down's syndrome), and it arises more frequently in some human tissues. Local changes in ploidy, known as copy number variations (CNV), arise throughout the genome. Here we present a method, implemented in the software polyHap, for the inference of haplotype phase and missing observations from polyploid genotypes. PolyHap allows each individual to have a different ploidy, but ploidy cannot vary over the genomic region analysed. It employs a hidden Markov model (HMM) and a sampling algorithm to infer haplotypes jointly in multiple individuals and to obtain a measure of uncertainty in its inferences. RESULTS: In the simulation study, we combine real haplotype data to create artificial diploid, triploid, and tetraploid genotypes, and use these to demonstrate that polyHap performs well, in terms of both switch error rate in recovering phase and imputation error rate for missing genotypes. To our knowledge, there is no comparable software for phasing a large, densely genotyped region of chromosome from triploids and tetraploids, while for diploids we found polyHap to be more accurate than fastPhase. We also compare the results of polyHap to SATlotyper on an experimentally haplotyped tetraploid dataset of 12 SNPs, and show that polyHap is more accurate. CONCLUSION: With the availability of large SNP data in polyploids and CNV regions, we believe that polyHap, our proposed method for inferring haplotypic phase from genotype data, will be useful in enabling researchers analysing such data to exploit the power of haplotype-based analyses.
  • Item
    Thumbnail Image
    Optimal pooling designs with error detection
    Balding, DJ ; Torney, DC (ACADEMIC PRESS INC JNL-COMP SUBSCRIPTIONS, 1996-04)
    Consider a collection of objects, some of which may be `bad', and a test which determines whether or not a given sub-collection contains no bad objects. The non-adaptive pooling (or group testing) problem involves identifying the bad objects using the least number of tests applied in parallel. The `hypergeometric' case occurs when an upper bound on the number of bad objects is known {\em a priori}. Here, practical considerations lead us to impose the additional requirement of {\em a posteriori} confirmation that the bound is satisfied. A generalization of the problem in which occasional errors in the test outcomes can occur is also considered. Optimal solutions to the general problem are shown to be equivalent to maximum-size collections of subsets of a finite set satisfying a union condition which generalizes that considered by Erd\"os \etal \cite{erd}. Lower bounds on the number of tests required are derived when the number of bad objects is believed to be either 1 or 2. Steiner systems are shown to be optimal solutions in some cases.
  • Item
    Thumbnail Image
    Common Genetic Variation Near Melatonin Receptor MTNR1B Contributes to Raised Plasma Glucose and Increased Risk of Type 2 Diabetes Among Indian Asians and European Caucasians
    Chambers, JC ; Zhang, W ; Zabaneh, D ; Sehmi, J ; Jain, P ; McCarthy, MI ; Froguel, P ; Ruokonen, A ; Balding, D ; Jarvelin, M-R ; Scott, J ; Elliott, P ; Kooner, JS (AMER DIABETES ASSOC, 2009-11)
    OBJECTIVE: Fasting plasma glucose and risk of type 2 diabetes are higher among Indian Asians than among European and North American Caucasians. Few studies have investigated genetic factors influencing glucose metabolism among Indian Asians. RESEARCH DESIGN AND METHODS: We carried out genome-wide association studies for fasting glucose in 5,089 nondiabetic Indian Asians genotyped with the Illumina Hap610 BeadChip and 2,385 Indian Asians (698 with type 2 diabetes) genotyped with the Illumina 300 BeadChip. Results were compared with findings in 4,462 European Caucasians. RESULTS: We identified three single nucleotide polymorphisms (SNPs) associated with glucose among Indian Asians at P < 5 x 10(-8), all near melatonin receptor MTNR1B. The most closely associated was rs2166706 (combined P = 2.1 x 10(-9)), which is in moderate linkage disequilibrium with rs1387153 (r(2) = 0.60) and rs10830963 (r(2) = 0.45), both previously associated with glucose in European Caucasians. Risk allele frequency and effect sizes for rs2166706 were similar among Indian Asians and European Caucasians: frequency 46.2 versus 45.0%, respectively (P = 0.44); effect 0.05 (95% CI 0.01-0.08) versus 0.05 (0.03-0.07 mmol/l), respectively, higher glucose per allele copy (P = 0.84). SNP rs2166706 was associated with type 2 diabetes in Indian Asians (odds ratio 1.21 [95% CI 1.06-1.38] per copy of risk allele; P = 0.006). SNPs at the GCK, GCKR, and G6PC2 loci were also associated with glucose among Indian Asians. Risk allele frequencies of rs1260326 (GCKR) and rs560887 (G6PC2) were higher among Indian Asians compared with European Caucasians. CONCLUSIONS: Common genetic variation near MTNR1B influences blood glucose and risk of type 2 diabetes in Indian Asians. Genetic variation at the MTNR1B, GCK, GCKR, and G6PC2 loci may contribute to abnormal glucose metabolism and related metabolic disturbances among Indian Asians.
  • Item
    Thumbnail Image
    Pathway Analysis of GWAS Provides New Insights into Genetic Susceptibility to 3 Inflammatory Diseases
    Eleftherohorinou, H ; Wright, V ; Hoggart, C ; Hartikainen, A-L ; Jarvelin, M-R ; Balding, D ; Coin, L ; Levin, M ; Weedon, MN (PUBLIC LIBRARY SCIENCE, 2009-11-30)
    Although the introduction of genome-wide association studies (GWAS) have greatly increased the number of genes associated with common diseases, only a small proportion of the predicted genetic contribution has so far been elucidated. Studying the cumulative variation of polymorphisms in multiple genes acting in functional pathways may provide a complementary approach to the more common single SNP association approach in understanding genetic determinants of common disease. We developed a novel pathway-based method to assess the combined contribution of multiple genetic variants acting within canonical biological pathways and applied it to data from 14,000 UK individuals with 7 common diseases. We tested inflammatory pathways for association with Crohn's disease (CD), rheumatoid arthritis (RA) and type 1 diabetes (T1D) with 4 non-inflammatory diseases as controls. Using a variable selection algorithm, we identified variants responsible for the pathway association and evaluated their use for disease prediction using a 10 fold cross-validation framework in order to calculate out-of-sample area under the Receiver Operating Curve (AUC). The generalisability of these predictive models was tested on an independent birth cohort from Northern Finland. Multiple canonical inflammatory pathways showed highly significant associations (p 10(-3)-10(-20)) with CD, T1D and RA. Variable selection identified on average a set of 205 SNPs (149 genes) for T1D, 350 SNPs (189 genes) for RA and 493 SNPs (277 genes) for CD. The pattern of polymorphisms at these SNPS were found to be highly predictive of T1D (91% AUC) and RA (85% AUC), and weakly predictive of CD (60% AUC). The predictive ability of the T1D model (without any parameter refitting) had good predictive ability (79% AUC) in the Finnish cohort. Our analysis suggests that genetic contribution to common inflammatory diseases operates through multiple genes interacting in functional pathways.
  • Item
    Thumbnail Image
    Gametic phase estimation over large genomic regions using an adaptive window approach.
    Excoffier, L ; Laval, G ; Balding, D (Springer Science and Business Media LLC, 2003-11)
    The authors present ELB, an easy to programme and computationally fast algorithm for inferring gametic phase in population samples of multilocus genotypes. Phase updates are made on the basis of a window of neighbouring loci, and the window size varies according to the local level of linkage disequilibrium. Thus, ELB is particularly well suited to problems involving many loci and/or relatively large genomic regions, including those with variable recombination rate. The authors have simulated population samples of single nucleotide polymorphism genotypes with varying levels of recombination and marker density, and find that ELB provides better local estimation of gametic phase than the PHASE or HTYPER programs, while its global accuracy is broadly similar. The relative improvement in local accuracy increases both with increasing recombination and with increasing marker density. Short tandem repeat (STR, or microsatellite) simulation studies demonstrate ELB's superiority over PHASE both globally and locally. Missing data are handled by ELB; simulations show that phase recovery is virtually unaffected by up to 2 per cent of missing data, but that phase estimation is noticeably impaired beyond this amount. The authors also applied ELB to datasets obtained from random pairings of 42 human X chromosomes typed at 97 diallelic markers in a 200 kb low-recombination region. Once again, they found ELB to have consistently better local accuracy than PHASE or HTYPER, while its global accuracy was close to the best.
  • Item
    Thumbnail Image
    Fregene: Simulation of realistic sequence-level data in populations and ascertained samples
    Chadeau-Hyam, M ; Hoggart, CJ ; O'Reilly, PF ; Whittaker, JC ; De Iorio, M ; Balding, DJ (BIOMED CENTRAL LTD, 2008-09-08)
    BACKGROUND: FREGENE simulates sequence-level data over large genomic regions in large populations. Because, unlike coalescent simulators, it works forwards through time, it allows complex scenarios of selection, demography, and recombination to be modelled simultaneously. Detailed tracking of sites under selection is implemented in FREGENE and provides the opportunity to test theoretical predictions and gain new insights into mechanisms of selection. We describe here main functionalities of both FREGENE and SAMPLE, a companion program that can replicate association study datasets. RESULTS: We report detailed analyses of six large simulated datasets that we have made publicly available. Three demographic scenarios are modelled: one panmictic, one substructured with migration, and one complex scenario that mimics the principle features of genetic variation in major worldwide human populations. For each scenario there is one neutral simulation, and one with a complex pattern of selection. CONCLUSION: FREGENE and the simulated datasets will be valuable for assessing the validity of models for selection, demography and population genetic parameters, as well as the efficacy of association studies. Its principle advantages are modelling flexibility and computational efficiency. It is open source and object-oriented. As such, it can be customised and the range of models extended.