School of Mathematics and Statistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 62
  • Item
    Thumbnail Image
    A hierarchical approach to removal of unwanted variation for large-scale metabolomics data
    Kim, T ; Tang, O ; Vernon, ST ; Kott, KA ; Koay, YC ; Park, J ; James, DE ; Grieve, SM ; Speed, TP ; Yang, P ; Figtree, GA ; O'Sullivan, JF ; Yang, JYH (NATURE PORTFOLIO, 2021-08-17)
    Liquid chromatography-mass spectrometry-based metabolomics studies are increasingly applied to large population cohorts, which run for several weeks or even years in data acquisition. This inevitably introduces unwanted intra- and inter-batch variations over time that can overshadow true biological signals and thus hinder potential biological discoveries. To date, normalisation approaches have struggled to mitigate the variability introduced by technical factors whilst preserving biological variance, especially for protracted acquisitions. Here, we propose a study design framework with an arrangement for embedding biological sample replicates to quantify variance within and between batches and a workflow that uses these replicates to remove unwanted variation in a hierarchical manner (hRUV). We use this design to produce a dataset of more than 1000 human plasma samples run over an extended period of time. We demonstrate significant improvement of hRUV over existing methods in preserving biological signals whilst removing unwanted variation for large scale metabolomics studies. Our tools not only provide a strategy for large scale data normalisation, but also provides guidance on the design strategy for large omics studies.
  • Item
    Thumbnail Image
    Strategies to enable large-scale proteomics for reproducible research
    Poulos, RC ; Hains, PG ; Shah, R ; Lucas, N ; Xavier, D ; Manda, SS ; Anees, A ; Koh, JMS ; Mahboob, S ; Wittman, M ; Williams, SG ; Sykes, EK ; Hecker, M ; Dausmann, M ; Wouters, MA ; Ashman, K ; Yang, J ; Wild, PJ ; deFazio, A ; Balleine, RL ; Tully, B ; Aebersold, R ; Speed, TP ; Liu, Y ; Reddel, RR ; Robinson, PJ ; Zhong, Q (NATURE RESEARCH, 2020-07-30)
    Reproducible research is the bedrock of experimental science. To enable the deployment of large-scale proteomics, we assess the reproducibility of mass spectrometry (MS) over time and across instruments and develop computational methods for improving quantitative accuracy. We perform 1560 data independent acquisition (DIA)-MS runs of eight samples containing known proportions of ovarian and prostate cancer tissue and yeast, or control HEK293T cells. Replicates are run on six mass spectrometers operating continuously with varying maintenance schedules over four months, interspersed with ~5000 other runs. We utilise negative controls and replicates to remove unwanted variation and enhance biological signal, outperforming existing methods. We also design a method for reducing missing values. Integrating these computational modules into a pipeline (ProNorM), we mitigate variation among instruments over time and accurately predict tissue proportions. We demonstrate how to improve the quantitative analysis of large-scale DIA-MS data, providing a pathway toward clinical proteomics.
  • Item
    Thumbnail Image
    Controlling technical variation amongst 6693 patient microarrays of the randomized MINDACT trial
    Jacob, L ; Witteveen, A ; Beumer, I ; Delahaye, L ; Wehkamp, D ; van den Akker, J ; Snel, M ; Chan, B ; Floore, A ; Bakx, N ; Brink, G ; Poncet, C ; Bogaerts, J ; Delorenzi, M ; Piccart, M ; Rutgers, E ; Cardoso, F ; Speed, T ; van't Veer, L ; Glas, A (NATURE PUBLISHING GROUP, 2020-07-27)
    Gene expression data obtained in large studies hold great promises for discovering disease signatures or subtypes through data analysis. It is also prone to technical variation, whose removal is essential to avoid spurious discoveries. Because this variation is not always known and can be confounded with biological signals, its removal is a challenging task. Here we provide a step-wise procedure and comprehensive analysis of the MINDACT microarray dataset. The MINDACT trial enrolled 6693 breast cancer patients and prospectively validated the gene expression signature MammaPrint for outcome prediction. The study also yielded a full-transcriptome microarray for each tumor. We show for the first time in such a large dataset how technical variation can be removed while retaining expected biological signals. Because of its unprecedented size, we hope the resulting adjusted dataset will be an invaluable tool to discover or test gene expression signatures and to advance our understanding of breast cancer.
  • Item
    Thumbnail Image
    Evaluating stably expressed genes in single cells
    Lin, Y ; Ghazanfar, S ; Strbenac, D ; Wang, A ; Patrick, E ; Lin, DM ; Speed, T ; Yang, JYH ; Yang, P (OXFORD UNIV PRESS, 2019-09-01)
    BACKGROUND: Single-cell RNA-seq (scRNA-seq) profiling has revealed remarkable variation in transcription, suggesting that expression of many genes at the single-cell level is intrinsically stochastic and noisy. Yet, on the cell population level, a subset of genes traditionally referred to as housekeeping genes (HKGs) are found to be stably expressed in different cell and tissue types. It is therefore critical to question whether stably expressed genes (SEGs) can be identified on the single-cell level, and if so, how can their expression stability be assessed? We have previously proposed a computational framework for ranking expression stability of genes in single cells for scRNA-seq data normalization and integration. In this study, we perform detailed evaluation and characterization of SEGs derived from this framework. RESULTS: Here, we show that gene expression stability indices derived from the early human and mouse development scRNA-seq datasets and the "Mouse Atlas" dataset are reproducible and conserved across species. We demonstrate that SEGs identified from single cells based on their stability indices are considerably more stable than HKGs defined previously from cell populations across diverse biological systems. Our analyses indicate that SEGs are inherently more stable at the single-cell level and their characteristics reminiscent of HKGs, suggesting their potential role in sustaining essential functions in individual cells. CONCLUSIONS: SEGs identified in this study have immediate utility both for understanding variation and stability of single-cell transcriptomes and for practical applications such as scRNA-seq data normalization. Our framework for calculating gene stability index, "scSEGIndex," is incorporated into the scMerge Bioconductor R package (https://sydneybiox.github.io/scMerge/reference/scSEGIndex.html) and can be used for identifying genes with stable expression in scRNA-seq datasets.
  • Item
    Thumbnail Image
    Accurate RNA Sequencing From Formalin-Fixed Cancer Tissue to Represent High-Quality Transcriptome From Frozen Tissue
    Li, J ; Fu, C ; Speed, TP ; Wang, W ; Symmans, WF (AMER SOC CLINICAL ONCOLOGY, 2018-01-26)
    PURPOSE: Accurate transcriptional sequencing (RNA-seq) from formalin-fixation and paraffin-embedding (FFPE) tumor samples presents an important challenge for translational research and diagnostic development. In addition, there are now several different protocols to prepare a sequencing library from total RNA. We evaluated the accuracy of RNA-seq data generated from FFPE samples in terms of expression profiling. METHODS: We designed a biospecimen study to directly compare gene expression results from different protocols to prepare libraries for RNA-seq from human breast cancer tissues, with randomization to fresh-frozen (FF) or FFPE conditions. The protocols were compared using multiple computational methods to assess alignment of reads to reference genome, and the uniformity and continuity of coverage; as well as the variance and correlation, of overall gene expression and patterns of measuring coding sequence, phenotypic patterns of gene expression, and measurements from representative multigene signatures. RESULTS: The principal determinant of variance in gene expression was use of exon capture probes, followed by the conditions of preservation (FF versus FFPE), and phenotypic differences between breast cancers. One protocol, with RNase H-based rRNA depletion, exhibited least variability of gene expression measurements, strongest correlation between FF and FFPE samples, and was generally representative of the transcriptome from standard FF RNA-seq protocols. CONCLUSION: Method of RNA-seq library preparation from FFPE samples had marked effect on the accuracy of gene expression measurement compared to matched FF samples. Nevertheless, some protocols produced highly concordant expression data from FFPE RNA-seq data, compared to RNA-seq results from matched frozen samples.
  • Item
    Thumbnail Image
    A TOOLKIT FOR THE QUANTITATIVE ANALYSIS OF THE SPATIAL DISTRIBUTION OF CELLS OF THE TUMOR IMMUNE MICROENVIRONMENT
    Trigos, A ; Yang, T ; Feng, Y ; Ozcoban, V ; Doyle, M ; Pasam, A ; Kocovski, N ; Pizzolla, A ; Huang, Y-K ; Bass, G ; Keam, S ; Speed, T ; Neeson, P ; Sandhu, S ; Goode, D (BMJ PUBLISHING GROUP, 2020-11-01)
  • Item
    Thumbnail Image
    Evaluation of cross-platform and interlaboratory concordance via consensus modelling of genomic measurements
    Peters, TJ ; French, HJ ; Bradford, ST ; Pidsley, R ; Stirzaker, C ; Varinli, H ; Nair, S ; Qu, W ; Song, J ; Giles, KA ; Statham, AL ; Speirs, H ; Speed, TP ; Clark, SJ ; Hancock, J (OXFORD UNIV PRESS, 2019-02-15)
    MOTIVATION: A synoptic view of the human genome benefits chiefly from the application of nucleic acid sequencing and microarray technologies. These platforms allow interrogation of patterns such as gene expression and DNA methylation at the vast majority of canonical loci, allowing granular insights and opportunities for validation of original findings. However, problems arise when validating against a "gold standard" measurement, since this immediately biases all subsequent measurements towards that particular technology or protocol. Since all genomic measurements are estimates, in the absence of a "gold standard" we instead empirically assess the measurement precision and sensitivity of a large suite of genomic technologies via a consensus modelling method called the row-linear model. This method is an application of the American Society for Testing and Materials Standard E691 for assessing interlaboratory precision and sources of variability across multiple testing sites. Both cross-platform and cross-locus comparisons can be made across all common loci, allowing identification of technology- and locus-specific tendencies. RESULTS: We assess technologies including the Infinium MethylationEPIC BeadChip, whole genome bisulfite sequencing (WGBS), two different RNA-Seq protocols (PolyA+ and Ribo-Zero) and five different gene expression array platforms. Each technology thus is characterised herein, relative to the consensus. We showcase a number of applications of the row-linear model, including correlation with known interfering traits. We demonstrate a clear effect of cross-hybridisation on the sensitivity of Infinium methylation arrays. Additionally, we perform a true interlaboratory test on a set of samples interrogated on the same platform across twenty-one separate testing laboratories. AVAILABILITY AND IMPLEMENTATION: A full implementation of the row-linear model, plus extra functions for visualisation, are found in the R package consensus at https://github.com/timpeters82/consensus. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
  • Item
    Thumbnail Image
    The healthy ageing gene expression signature for Alzheimer's disease diagnosis: a random sampling perspective
    Jacob, L ; Speed, TP (BMC, 2018-07-25)
    In a recent publication, Sood et al. (Genome Biol 16:185, 2015) presented a set of 150 probe sets that could be used in the diagnosis of Alzheimer's disease (AD) based on gene expression. We reproduce some of their experiments and show that their signature is indeed able to discriminate between AD and control patients using blood gene expression in two cohorts. We also show that its performance does not stand out compared to randomly sampled sets of 150 probe sets from the same array.
  • Item
    Thumbnail Image
    G protein-linked signaling pathways in bipolar and major depressive disorders.
    Tomita, H ; Ziegler, ME ; Kim, HB ; Evans, SJ ; Choudary, PV ; Li, JZ ; Meng, F ; Dai, M ; Myers, RM ; Neal, CR ; Speed, TP ; Barchas, JD ; Schatzberg, AF ; Watson, SJ ; Akil, H ; Jones, EG ; Bunney, WE ; Vawter, MP (Frontiers Media SA, 2013)
    The G-protein linked signaling system (GPLS) comprises a large number of G-proteins, G protein-coupled receptors (GPCRs), GPCR ligands, and downstream effector molecules. G-proteins interact with both GPCRs and downstream effectors such as cyclic adenosine monophosphate (cAMP), phosphatidylinositols, and ion channels. The GPLS is implicated in the pathophysiology and pharmacology of both major depressive disorder (MDD) and bipolar disorder (BPD). This study evaluated whether GPLS is altered at the transcript level. The gene expression in the dorsolateral prefrontal (DLPFC) and anterior cingulate (ACC) were compared from MDD, BPD, and control subjects using Affymetrix Gene Chips and real time quantitative PCR. High quality brain tissue was used in the study to control for confounding effects of agonal events, tissue pH, RNA integrity, gender, and age. GPLS signaling transcripts were altered especially in the ACC of BPD and MDD subjects. Transcript levels of molecules which repress cAMP activity were increased in BPD and decreased in MDD. Two orphan GPCRs, GPRC5B and GPR37, showed significantly decreased expression levels in MDD, and significantly increased expression levels in BPD. Our results suggest opposite changes in BPD and MDD in the GPLS, "activated" cAMP signaling activity in BPD and "blunted" cAMP signaling activity in MDD. GPRC5B and GPR37 both appear to have behavioral effects, and are also candidate genes for neurodegenerative disorders. In the context of the opposite changes observed in BPD and MDD, these GPCRs warrant further study of their brain effects.
  • Item
    No Preview Available
    Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data
    Dai, MH ; Wang, PL ; Boyd, AD ; Kostov, G ; Athey, B ; Jones, EG ; Bunney, WE ; Myers, RM ; Speed, TP ; Akil, H ; Watson, SJ ; Meng, F (OXFORD UNIV PRESS, 2005-01-01)
    Genome-wide expression profiling is a powerful tool for implicating novel gene ensembles in cellular mechanisms of health and disease. The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge. The resultant informatics problems have a profound impact on analysis and interpretation the data. Here, we address these critical issues and offer a solution. We identified several classes of problems at the individual probe level in the existing annotation, under the assumption that current genome and transcriptome databases are more accurate than those used for GeneChip design. We then reorganized probes on more than a dozen popular GeneChips into gene-, transcript- and exon-specific probe sets in light of up-to-date genome, cDNA/EST clustering and single nucleotide polymorphism information. Comparing analysis results between the original and the redefined probe sets reveals approximately 30-50% discrepancy in the genes previously identified as differentially expressed, regardless of analysis method. Our results demonstrate that the original Affymetrix probe set definitions are inaccurate, and many conclusions derived from past GeneChip analyses may be significantly flawed. It will be beneficial to re-analyze existing GeneChip data with updated probe set definitions.