School of Mathematics and Statistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 61
  • Item
    No Preview Available
    A founder event causing a dominant childhood epilepsy survives 800 years through weak selective pressure
    Grinton, BE ; Robertson, E ; Fearnley, LG ; Scheffer, IE ; Marson, AG ; O'Brien, TJ ; Pickrell, WO ; Rees, M ; Sisodiya, SM ; Balding, DJ ; Bennett, MF ; Bahlo, M ; Berkovic, SF ; Oliver, KL (CELL PRESS, 2022-11-03)
    Genetic epilepsy with febrile seizures plus (GEFS+) is an autosomal dominant familial epilepsy syndrome characterized by distinctive phenotypic heterogeneity within families. The SCN1B c.363C>G (p.Cys121Trp) variant has been identified in independent, multi-generational families with GEFS+. Although the variant is present in population databases (at very low frequency), there is strong clinical, genetic, and functional evidence to support pathogenicity. Recurrent variants may be due to a founder event in which the variant has been inherited from a common ancestor. Here, we report evidence of a single founder event giving rise to the SCN1B c.363C>G variant in 14 independent families with epilepsy. A common haplotype was observed in all families, and the age of the most recent common ancestor was estimated to be approximately 800 years ago. Analysis of UK Biobank whole-exome-sequencing data identified 74 individuals with the same variant. All individuals carried haplotypes matching the epilepsy-affected families, suggesting all instances of the variant derive from a single mutational event. This unusual finding of a variant causing an autosomal dominant, early-onset disease in an outbred population that has persisted over many generations can be attributed to the relatively mild phenotype in most carriers and incomplete penetrance. Founder events are well established in autosomal recessive and late-onset disorders but are rarely observed in early-onset, autosomal dominant diseases. These findings suggest variants present in the population at low frequencies should be considered potentially pathogenic in mild phenotypes with incomplete penetrance and may be more important contributors to the genetic landscape than previously thought.
  • Item
    No Preview Available
    LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics
    Berrandou, T-E ; Balding, D ; Speed, D (CELL PRESS, 2023-01-05)
    We present LDAK-GBAT, a tool for gene-based association testing using summary statistics from genome-wide association studies that is computationally efficient, produces well-calibrated p values, and is significantly more powerful than existing tools. LDAK-GBAT takes approximately 30 min to analyze imputed data (2.9M common, genic SNPs), requiring less than 10 Gb memory. It shows good control of type 1 error given an appropriate reference panel. Across 109 phenotypes (82 from the UK Biobank, 18 from the Million Veteran Program, and nine from the Psychiatric Genetics Consortium), LDAK-GBAT finds on average 19% (SE: 1%) more significant genes than the existing tool sumFREGAT-ACAT, with even greater gains in comparison with MAGMA, GCTA-fastBAT, sumFREGAT-SKAT-O, and sumFREGAT-PCA.
  • Item
    Thumbnail Image
    Fast and accurate joint inference of coancestry parameters for populations and/or individuals
    Mary-Huard, TM ; Balding, D ; Weir, BS (PUBLIC LIBRARY SCIENCE, 2023-01)
    We introduce a fast, new algorithm for inferring from allele count data the FST parameters describing genetic distances among a set of populations and/or unrelated diploid individuals, and a tree with branch lengths corresponding to FST values. The tree can reflect historical processes of splitting and divergence, but seeks to represent the actual genetic variance as accurately as possible with a tree structure. We generalise two major approaches to defining FST, via correlations and mismatch probabilities of sampled allele pairs, which measure shared and non-shared components of genetic variance. A diploid individual can be treated as a population of two gametes, which allows inference of coancestry coefficients for individuals as well as for populations, or a combination of the two. A simulation study illustrates that our fast method-of-moments estimation of FST values, simultaneously for multiple populations/individuals, gains statistical efficiency over pairwise approaches when the population structure is close to tree-like. We apply our approach to genome-wide genotypes from the 26 worldwide human populations of the 1000 Genomes Project. We first analyse at the population level, then a subset of individuals and in a final analysis we pool individuals from the more homogeneous populations. This flexible analysis approach gives advantages over traditional approaches to population structure/coancestry, including visual and quantitative assessments of long-standing questions about the relative magnitudes of within- and between-population genetic differences.
  • Item
    Thumbnail Image
    Including diverse and admixed populations in genetic epidemiology research
    Caliebe, A ; Tekola-Ayele, F ; Darst, BF ; Wang, X ; Song, YE ; Gui, J ; Sebro, RA ; Balding, DJ ; Saad, M ; Dube, M-P (WILEY, 2022-10)
    The inclusion of ancestrally diverse participants in genetic studies can lead to new discoveries and is important to ensure equitable health care benefit from research advances. Here, members of the Ethical, Legal, Social, Implications (ELSI) committee of the International Genetic Epidemiology Society (IGES) offer perspectives on methods and analysis tools for the conduct of inclusive genetic epidemiology research, with a focus on admixed and ancestrally diverse populations in support of reproducible research practices. We emphasize the importance of distinguishing socially defined population categorizations from genetic ancestry in the design, analysis, reporting, and interpretation of genetic epidemiology research findings. Finally, we discuss the current state of genomic resources used in genetic association studies, functional interpretation, and clinical and public health translation of genomic findings with respect to diverse populations.
  • Item
    No Preview Available
    Prediction of eye, hair and skin colour in Latin Americans
    Palmal, S ; Adhikari, K ; Mendoza-Revilla, J ; Fuentes-Guajardo, M ; de Cerqueira, CCS ; Bonfante, B ; Chacon-Duque, JC ; Sohail, A ; Hurtado, M ; Villegas, V ; Granja, V ; Jaramillo, C ; Arias, W ; Barquera Lozano, R ; Everardo-Martinez, P ; Gomez-Valdes, J ; Villamil-Ramirez, H ; Hunemeier, T ; Ramallo, V ; Parolin, M-L ; Gonzalez-Jose, R ; Schuler-Faccini, L ; Bortolini, M-C ; Acuna-Alonzo, V ; Canizales-Quinteros, S ; Gallo, C ; Poletti, G ; Bedoya, G ; Rothhammer, F ; Balding, D ; Faux, P ; Ruiz-Linares, A (ELSEVIER IRELAND LTD, 2021-07)
    Here we evaluate the accuracy of prediction for eye, hair and skin pigmentation in a dataset of > 6500 individuals from Mexico, Colombia, Peru, Chile and Brazil (including genome-wide SNP data and quantitative/categorical pigmentation phenotypes - the CANDELA dataset CAN). We evaluated accuracy in relation to different analytical methods and various phenotypic predictors. As expected from statistical principles, we observe that quantitative traits are more sensitive to changes in the prediction models than categorical traits. We find that Random Forest or Linear Regression are generally the best performing methods. We also compare the prediction accuracy of SNP sets defined in the CAN dataset (including 56, 101 and 120 SNPs for eye, hair and skin colour prediction, respectively) to the well-established HIrisPlex-S SNP set (including 6, 22 and 36 SNPs for eye, hair and skin colour prediction respectively). When training prediction models on the CAN data, we observe remarkably similar performances for HIrisPlex-S and the larger CAN SNP sets for the prediction of hair (categorical) and eye (both categorical and quantitative), while the CAN sets outperform HIrisPlex-S for quantitative, but not for categorical skin pigmentation prediction. The performance of HIrisPlex-S, when models are trained in a world-wide sample (although consisting of 80% Europeans, https://hirisplex.erasmusmc.nl), is lower relative to training in the CAN data (particularly for hair and skin colour). Altogether, our observations are consistent with common variation of eye and hair colour having a relatively simple genetic architecture, which is well captured by HIrisPlex-S, even in admixed Latin Americans (with partial European ancestry). By contrast, since skin pigmentation is a more polygenic trait, accuracy is more sensitive to prediction SNP set size, although here this effect was only apparent for a quantitative measure of skin pigmentation. Our results support the use of HIrisPlex-S in the prediction of categorical pigmentation traits for forensic purposes in Latin America, while illustrating the impact of training datasets on its accuracy.
  • Item
    Thumbnail Image
    SNP-based heritability and selection analyses: Improved models and new results
    Speed, D ; Kaphle, A ; Balding, DJ (WILEY, 2022-05)
    Complex-trait genetics has advanced dramatically through methods to estimate the heritability tagged by SNPs, both genome-wide and in genomic regions of interest such as those defined by functional annotations. The models underlying many of these analyses are inadequate, and consequently many SNP-heritability results published to date are inaccurate. Here, we review the modelling issues, both for analyses based on individual genotype data and association test statistics, highlighting the role of a low-dimensional model for the heritability of each SNP. We use state-of-art models to present updated results about how heritability is distributed with respect to functional annotations in the human genome, and how it varies with allele frequency, which can reflect purifying selection. Our results give finer detail to the picture that has emerged in recent years of complex trait heritability widely dispersed across the genome. Confounding due to population structure remains a problem that summary statistic analyses cannot reliably overcome. Also see the video abstract here: https://youtu.be/WC2u03V65MQ.
  • Item
    Thumbnail Image
    Genome-wide association mapping of Hagberg falling number, protein content, test weight, and grain yield in UK wheat
    White, J ; Sharma, R ; Balding, D ; Cockram, J ; Mackay, IJ (WILEY, 2022-05)
    Association mapping using crop cultivars allows identification of genetic loci of direct relevance to breeding. Here, 150 U.K. wheat (Triticum aestivum L.) cultivars genotyped with 23,288 single nucleotide polymorphisms (SNPs) were used for genome-wide association studies (GWAS) using historical phenotypic data for grain protein content, Hagberg falling number (HFN), test weight, and grain yield. Power calculations indicated experimental design would enable detection of quantitative trait loci (QTL) explaining ≥20% of the variation (PVE) at a relatively high power of >80%, falling to 40% for detection of a SNP with an R2 ≥ .5 with the same QTL. Genome-wide association studies identified marker-trait associations for all four traits. For HFN (h 2 = .89), six QTL were identified, including a major locus on chromosome 7B explaining 49% PVE and reducing HFN by 44 s. For protein content (h 2 = 0.86), 10 QTL were found on chromosomes 1A, 2A, 2B, 3A, 3B, and 6B, together explaining 48.9% PVE. For test weight, five QTL were identified (one on 1B and four on 3B; 26.3% PVE). Finally, 14 loci were identified for grain yield (h 2 = 0.95) on eight chromosomes (1A, 2A, 2B, 2D, 3A, 5B, 6A, 6B; 68.1% PVE), of which five were located within 16 Mbp of genetic regions previously identified as under breeder selection in European wheat. Our study demonstrates the utility of exploiting historical crop datasets, identifying genomic targets for independent validation, and ultimately for wheat genetic improvement.
  • Item
    Thumbnail Image
    Disentangling Signatures of Selection Before and After European Colonization in Latin Americans
    Mendoza-Revilla, J ; Chacon-Duque, JC ; Fuentes-Guajardo, M ; Ormond, L ; Wang, K ; Hurtado, M ; Villegas, V ; Granja, V ; Acuna-Alonzo, V ; Jaramillo, C ; Arias, W ; Barquera, R ; Gomez-Valdes, J ; Villamil-Ramirez, H ; de Cerqueira, CCS ; Rivera, KMB ; Nieves-Colon, MA ; Gignoux, CR ; Wojcik, GL ; Moreno-Estrada, A ; Hunemeier, T ; Ramallo, V ; Schuler-Faccini, L ; Gonzalez-Jose, R ; Bortolini, M-C ; Canizales-Quinteros, S ; Gallo, C ; Poletti, G ; Bedoya, G ; Rothhammer, F ; Balding, D ; Fumagalli, M ; Adhikari, K ; Ruiz-Linares, A ; Hellenthal, G ; Kim, Y (OXFORD UNIV PRESS, 2022-04-11)
    Throughout human evolutionary history, large-scale migrations have led to intermixing (i.e., admixture) between previously separated human groups. Although classical and recent work have shown that studying admixture can yield novel historical insights, the extent to which this process contributed to adaptation remains underexplored. Here, we introduce a novel statistical model, specific to admixed populations, that identifies loci under selection while determining whether the selection likely occurred post-admixture or prior to admixture in one of the ancestral source populations. Through extensive simulations, we show that this method is able to detect selection, even in recently formed admixed populations, and to accurately differentiate between selection occurring in the ancestral or admixed population. We apply this method to genome-wide SNP data of ∼4,000 individuals in five admixed Latin American cohorts from Brazil, Chile, Colombia, Mexico, and Peru. Our approach replicates previous reports of selection in the human leukocyte antigen region that are consistent with selection post-admixture. We also report novel signals of selection in genomic regions spanning 47 genes, reinforcing many of these signals with an alternative, commonly used local-ancestry-inference approach. These signals include several genes involved in immunity, which may reflect responses to endemic pathogens of the Americas and to the challenge of infectious disease brought by European contact. In addition, some of the strongest signals inferred to be under selection in the Native American ancestral groups of modern Latin Americans overlap with genes implicated in energy metabolism phenotypes, plausibly reflecting adaptations to novel dietary sources available in the Americas.
  • Item
    Thumbnail Image
    Bayesian inference of ancestral recombination graphs
    Mahmoudi, A ; Koskela, J ; Kelleher, J ; Chan, Y-B ; Balding, D ; Kosakovsky Pond, SL (PUBLIC LIBRARY SCIENCE, 2022-03)
    We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.
  • Item
    Thumbnail Image
    Genome-wide association, prediction and heritability in bacteria with application to Streptococcus pneumoniae
    Mallawaarachchi, S ; Tonkin-Hill, G ; Croucher, NJ ; Turner, P ; Speed, D ; Corander, J ; Balding, D (OXFORD UNIV PRESS, 2022-01-13)
    Whole-genome sequencing has facilitated genome-wide analyses of association, prediction and heritability in many organisms. However, such analyses in bacteria are still in their infancy, being limited by difficulties including genome plasticity and strong population structure. Here we propose a suite of methods including linear mixed models, elastic net and LD-score regression, adapted to bacterial traits using innovations such as frequency-based allele coding, both insertion/deletion and nucleotide testing and heritability partitioning. We compare and validate our methods against the current state-of-art using simulations, and analyse three phenotypes of the major human pathogen Streptococcus pneumoniae, including the first analyses of minimum inhibitory concentrations (MIC) for penicillin and ceftriaxone. We show that the MIC traits are highly heritable with high prediction accuracy, explained by many genetic associations under good population structure control. In ceftriaxone MIC, this is surprising because none of the isolates are resistant as per the inhibition zone criteria. We estimate that half of the heritability of penicillin MIC is explained by a known drug-resistance region, which also contributes a quarter of the ceftriaxone MIC heritability. For the within-host carriage duration phenotype, no associations were observed, but the moderate heritability and prediction accuracy indicate a moderately polygenic trait.