School of Mathematics and Statistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 56
  • Item
    No Preview Available
    Prediction of eye, hair and skin colour in Latin Americans
    Palmal, S ; Adhikari, K ; Mendoza-Revilla, J ; Fuentes-Guajardo, M ; de Cerqueira, CCS ; Bonfante, B ; Chacon-Duque, JC ; Sohail, A ; Hurtado, M ; Villegas, V ; Granja, V ; Jaramillo, C ; Arias, W ; Barquera Lozano, R ; Everardo-Martinez, P ; Gomez-Valdes, J ; Villamil-Ramirez, H ; Hunemeier, T ; Ramallo, V ; Parolin, M-L ; Gonzalez-Jose, R ; Schuler-Faccini, L ; Bortolini, M-C ; Acuna-Alonzo, V ; Canizales-Quinteros, S ; Gallo, C ; Poletti, G ; Bedoya, G ; Rothhammer, F ; Balding, D ; Faux, P ; Ruiz-Linares, A (ELSEVIER IRELAND LTD, 2021-04-14)
    Here we evaluate the accuracy of prediction for eye, hair and skin pigmentation in a dataset of > 6500 individuals from Mexico, Colombia, Peru, Chile and Brazil (including genome-wide SNP data and quantitative/categorical pigmentation phenotypes - the CANDELA dataset CAN). We evaluated accuracy in relation to different analytical methods and various phenotypic predictors. As expected from statistical principles, we observe that quantitative traits are more sensitive to changes in the prediction models than categorical traits. We find that Random Forest or Linear Regression are generally the best performing methods. We also compare the prediction accuracy of SNP sets defined in the CAN dataset (including 56, 101 and 120 SNPs for eye, hair and skin colour prediction, respectively) to the well-established HIrisPlex-S SNP set (including 6, 22 and 36 SNPs for eye, hair and skin colour prediction respectively). When training prediction models on the CAN data, we observe remarkably similar performances for HIrisPlex-S and the larger CAN SNP sets for the prediction of hair (categorical) and eye (both categorical and quantitative), while the CAN sets outperform HIrisPlex-S for quantitative, but not for categorical skin pigmentation prediction. The performance of HIrisPlex-S, when models are trained in a world-wide sample (although consisting of 80% Europeans, https://hirisplex.erasmusmc.nl), is lower relative to training in the CAN data (particularly for hair and skin colour). Altogether, our observations are consistent with common variation of eye and hair colour having a relatively simple genetic architecture, which is well captured by HIrisPlex-S, even in admixed Latin Americans (with partial European ancestry). By contrast, since skin pigmentation is a more polygenic trait, accuracy is more sensitive to prediction SNP set size, although here this effect was only apparent for a quantitative measure of skin pigmentation. Our results support the use of HIrisPlex-S in the prediction of categorical pigmentation traits for forensic purposes in Latin America, while illustrating the impact of training datasets on its accuracy.
  • Item
    Thumbnail Image
    SNP-based heritability and selection analyses: Improved models and new results
    Speed, D ; Kaphle, A ; Balding, DJ (WILEY, 2022-03-13)
    Complex-trait genetics has advanced dramatically through methods to estimate the heritability tagged by SNPs, both genome-wide and in genomic regions of interest such as those defined by functional annotations. The models underlying many of these analyses are inadequate, and consequently many SNP-heritability results published to date are inaccurate. Here, we review the modelling issues, both for analyses based on individual genotype data and association test statistics, highlighting the role of a low-dimensional model for the heritability of each SNP. We use state-of-art models to present updated results about how heritability is distributed with respect to functional annotations in the human genome, and how it varies with allele frequency, which can reflect purifying selection. Our results give finer detail to the picture that has emerged in recent years of complex trait heritability widely dispersed across the genome. Confounding due to population structure remains a problem that summary statistic analyses cannot reliably overcome. Also see the video abstract here: https://youtu.be/WC2u03V65MQ.
  • Item
    Thumbnail Image
    Genome-wide association mapping of Hagberg falling number, protein content, test weight, and grain yield in UK wheat
    White, J ; Sharma, R ; Balding, D ; Cockram, J ; Mackay, IJ (WILEY, 2022-03-04)
    Association mapping using crop cultivars allows identification of genetic loci of direct relevance to breeding. Here, 150 U.K. wheat (Triticum aestivum L.) cultivars genotyped with 23,288 single nucleotide polymorphisms (SNPs) were used for genome-wide association studies (GWAS) using historical phenotypic data for grain protein content, Hagberg falling number (HFN), test weight, and grain yield. Power calculations indicated experimental design would enable detection of quantitative trait loci (QTL) explaining ≥20% of the variation (PVE) at a relatively high power of >80%, falling to 40% for detection of a SNP with an R2 ≥ .5 with the same QTL. Genome-wide association studies identified marker-trait associations for all four traits. For HFN (h 2 = .89), six QTL were identified, including a major locus on chromosome 7B explaining 49% PVE and reducing HFN by 44 s. For protein content (h 2 = 0.86), 10 QTL were found on chromosomes 1A, 2A, 2B, 3A, 3B, and 6B, together explaining 48.9% PVE. For test weight, five QTL were identified (one on 1B and four on 3B; 26.3% PVE). Finally, 14 loci were identified for grain yield (h 2 = 0.95) on eight chromosomes (1A, 2A, 2B, 2D, 3A, 5B, 6A, 6B; 68.1% PVE), of which five were located within 16 Mbp of genetic regions previously identified as under breeder selection in European wheat. Our study demonstrates the utility of exploiting historical crop datasets, identifying genomic targets for independent validation, and ultimately for wheat genetic improvement.
  • Item
    Thumbnail Image
    Disentangling Signatures of Selection Before and After European Colonization in Latin Americans
    Mendoza-Revilla, J ; Chacon-Duque, JC ; Fuentes-Guajardo, M ; Ormond, L ; Wang, K ; Hurtado, M ; Villegas, V ; Granja, V ; Acuna-Alonzo, V ; Jaramillo, C ; Arias, W ; Barquera, R ; Gomez-Valdes, J ; Villamil-Ramirez, H ; de Cerqueira, CCS ; Rivera, KMB ; Nieves-Colon, MA ; Gignoux, CR ; Wojcik, GL ; Moreno-Estrada, A ; Hunemeier, T ; Ramallo, V ; Schuler-Faccini, L ; Gonzalez-Jose, R ; Bortolini, M-C ; Canizales-Quinteros, S ; Gallo, C ; Poletti, G ; Bedoya, G ; Rothhammer, F ; Balding, D ; Fumagalli, M ; Adhikari, K ; Ruiz-Linares, A ; Hellenthal, G ; Kim, Y (OXFORD UNIV PRESS, 2022-04-11)
    Throughout human evolutionary history, large-scale migrations have led to intermixing (i.e., admixture) between previously separated human groups. Although classical and recent work have shown that studying admixture can yield novel historical insights, the extent to which this process contributed to adaptation remains underexplored. Here, we introduce a novel statistical model, specific to admixed populations, that identifies loci under selection while determining whether the selection likely occurred post-admixture or prior to admixture in one of the ancestral source populations. Through extensive simulations, we show that this method is able to detect selection, even in recently formed admixed populations, and to accurately differentiate between selection occurring in the ancestral or admixed population. We apply this method to genome-wide SNP data of ∼4,000 individuals in five admixed Latin American cohorts from Brazil, Chile, Colombia, Mexico, and Peru. Our approach replicates previous reports of selection in the human leukocyte antigen region that are consistent with selection post-admixture. We also report novel signals of selection in genomic regions spanning 47 genes, reinforcing many of these signals with an alternative, commonly used local-ancestry-inference approach. These signals include several genes involved in immunity, which may reflect responses to endemic pathogens of the Americas and to the challenge of infectious disease brought by European contact. In addition, some of the strongest signals inferred to be under selection in the Native American ancestral groups of modern Latin Americans overlap with genes implicated in energy metabolism phenotypes, plausibly reflecting adaptations to novel dietary sources available in the Americas.
  • Item
    Thumbnail Image
    Bayesian inference of ancestral recombination graphs
    Mahmoudi, A ; Koskela, J ; Kelleher, J ; Chan, Y-B ; Balding, D ; Kosakovsky Pond, SL (PUBLIC LIBRARY SCIENCE, 2022-03-01)
    We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.
  • Item
    Thumbnail Image
    Genome-wide association, prediction and heritability in bacteria with application to Streptococcus pneumoniae
    Mallawaarachchi, S ; Tonkin-Hill, G ; Croucher, NJ ; Turner, P ; Speed, D ; Corander, J ; Balding, D (OXFORD UNIV PRESS, 2022-01-13)
    Whole-genome sequencing has facilitated genome-wide analyses of association, prediction and heritability in many organisms. However, such analyses in bacteria are still in their infancy, being limited by difficulties including genome plasticity and strong population structure. Here we propose a suite of methods including linear mixed models, elastic net and LD-score regression, adapted to bacterial traits using innovations such as frequency-based allele coding, both insertion/deletion and nucleotide testing and heritability partitioning. We compare and validate our methods against the current state-of-art using simulations, and analyse three phenotypes of the major human pathogen Streptococcus pneumoniae, including the first analyses of minimum inhibitory concentrations (MIC) for penicillin and ceftriaxone. We show that the MIC traits are highly heritable with high prediction accuracy, explained by many genetic associations under good population structure control. In ceftriaxone MIC, this is surprising because none of the isolates are resistant as per the inhibition zone criteria. We estimate that half of the heritability of penicillin MIC is explained by a known drug-resistance region, which also contributes a quarter of the ceftriaxone MIC heritability. For the within-host carriage duration phenotype, no associations were observed, but the moderate heritability and prediction accuracy indicate a moderately polygenic trait.
  • Item
    Thumbnail Image
    Retraction of a peer reviewed article suggests ongoing problems with Australian forensic science.
    Brook, C ; Lynøe, N ; Eriksson, A ; Balding, D (Elsevier BV, 2021)
    We describe events arising from the case of Joby Rowe, convicted of the homicide of his three month old daughter, and explore what they illustrate about systemic problems in the forensic science community in Australia. A peer reviewed journal article that scrutinized the forensic evidence presented in the Rowe case was retracted by a forensic science journal for reasons unrelated to quality or accuracy, under pressure from forensic medical experts criticized in the article. Details of the retraction obtained through freedom of information mechanisms reveal improper pressure and subversion of publishing processes in order to avoid scrutiny. The retraction was supported by the editorial board and two Australian forensic science societies, which is indicative of serious deficiencies in the leadership of forensic science in Australia. We propose paths forward including blind peer review, publication of expert reports, and a criminal cases review authority, that would help stimulate a culture that encourages scrutiny, and relies on evidence-based rather than eminence-based knowledge.
  • Item
    Thumbnail Image
    Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation
    Traynelis, J ; Silk, M ; Wang, Q ; Berkovic, SF ; Liu, L ; Ascher, DB ; Balding, DJ ; Petrovski, S (COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT, 2017-10-01)
    Gene panel and exome sequencing have revealed a high rate of molecular diagnoses among diseases where the genetic architecture has proven suitable for sequencing approaches, with a large number of distinct and highly penetrant causal variants identified among a growing list of disease genes. The challenge is, given the DNA sequence of a new patient, to distinguish disease-causing from benign variants. Large samples of human standing variation data highlight regional variation in the tolerance to missense variation within the protein-coding sequence of genes. This information is not well captured by existing bioinformatic tools, but is effective in improving variant interpretation. To address this limitation in existing tools, we introduce the missense tolerance ratio (MTR), which summarizes available human standing variation data within genes to encapsulate population level genetic variation. We find that patient-ascertained pathogenic variants preferentially cluster in low MTR regions (P < 0.005) of well-informed genes. By evaluating 20 publicly available predictive tools across genes linked to epilepsy, we also highlight the importance of understanding the empirical null distribution of existing prediction tools, as these vary across genes. Subsequently integrating the MTR with the empirically selected bioinformatic tools in a gene-specific approach demonstrates a clear improvement in the ability to predict pathogenic missense variants from background missense variation in disease genes. Among an independent test sample of case and control missense variants, case variants (0.83 median score) consistently achieve higher pathogenicity prediction probabilities than control variants (0.02 median score; Mann-Whitney U test, P < 1 × 10-16). We focus on the application to epilepsy genes; however, the framework is applicable to disease genes beyond epilepsy.
  • Item
    Thumbnail Image
    Assessing the Forensic Value of DNA Evidence from Y Chromosomes and Mitogenomes
    Andersen, MM ; Balding, DJ (MDPI, 2021-08-01)
    Y chromosome and mitochondrial DNA profiles have been used as evidence in courts for decades, yet the problem of evaluating the weight of evidence has not been adequately resolved. Both are lineage markers (inherited from just one parent), which presents different interpretation challenges compared with standard autosomal DNA profiles (inherited from both parents). We review approaches to the evaluation of lineage marker profiles for forensic identification, focussing on the key roles of profile mutation rate and relatedness (extending beyond known relatives). Higher mutation rates imply fewer individuals matching the profile of an alleged contributor, but they will be more closely related. This makes it challenging to evaluate the possibility that one of these matching individuals could be the true source, because relatives may be plausible alternative contributors, and may not be well mixed in the population. These issues reduce the usefulness of profile databases drawn from a broad population: larger populations can have a lower profile relative frequency because of lower relatedness with the alleged contributor. Many evaluation methods do not adequately take account of distant relatedness, but its effects have become more pronounced with the latest generation of high-mutation-rate Y profiles.
  • Item
    Thumbnail Image
    Summary statistic analyses can mistake confounding bias for heritability
    Holmes, JB ; Speed, D ; Balding, DJ (WILEY, 2019-09-20)
    Linkage disequilibrium SCore regression (LDSC) has become a popular approach to estimate confounding bias, heritability, and genetic correlation using only genome-wide association study (GWAS) test statistics. SumHer is a newly introduced alternative with similar aims. We show using theory and simulations that both approaches fail to adequately account for confounding bias, even when the assumed heritability model is correct. Consequently, these methods may estimate heritability poorly if there was an inadequate adjustment for confounding in the original GWAS analysis. We also show that the choice of a summary statistic for use in LDSC or SumHer can have a large impact on resulting inferences. Further, covariate adjustments in the original GWAS can alter the target of heritability estimation, which can be problematic for test statistics from a meta-analysis of GWAS with different covariate adjustments.