School of Mathematics and Statistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 64
  • Item
    No Preview Available
    Runx3 drives a CD8+ T cell tissue residency program that is absent in CD4+ T cells
    Fonseca, R ; Burn, TN ; Gandolfo, LC ; Devi, S ; Park, SL ; Obers, A ; Evrard, M ; Christo, SN ; Buquicchio, FA ; Lareau, CA ; McDonald, KM ; Sandford, SK ; Zamudio, NM ; Zanluqui, NG ; Zaid, A ; Speed, TP ; Satpathy, AT ; Mueller, SN ; Carbone, FR ; Mackay, LK (NATURE PORTFOLIO, 2022-08)
    Tissue-resident memory T cells (TRM cells) provide rapid and superior control of localized infections. While the transcription factor Runx3 is a critical regulator of CD8+ T cell tissue residency, its expression is repressed in CD4+ T cells. Here, we show that, as a direct consequence of this Runx3-deficiency, CD4+ TRM cells lacked the transforming growth factor (TGF)-β-responsive transcriptional network that underpins the tissue residency of epithelial CD8+ TRM cells. While CD4+ TRM cell formation required Runx1, this, along with the modest expression of Runx3 in CD4+ TRM cells, was insufficient to engage the TGF-β-driven residency program. Ectopic expression of Runx3 in CD4+ T cells incited this TGF-β-transcriptional network to promote prolonged survival, decreased tissue egress, a microanatomical redistribution towards epithelial layers and enhanced effector functionality. Thus, our results reveal distinct programming of tissue residency in CD8+ and CD4+ TRM cell subsets that is attributable to divergent Runx3 activity.
  • Item
    Thumbnail Image
    Estimation of tumor cell total mRNA expression in 15 cancer types predicts disease progression
    Cao, S ; Wang, JR ; Ji, S ; Yang, P ; Dai, Y ; Guo, S ; Montierth, MD ; Shen, JP ; Zhao, X ; Chen, J ; Lee, JJ ; Guerrero, PA ; Spetsieris, N ; Engedal, N ; Taavitsainen, S ; Yu, K ; Livingstone, J ; Bhandari, V ; Hubert, SM ; Daw, NC ; Futreal, PA ; Efstathiou, E ; Lim, B ; Viale, A ; Zhang, J ; Nykter, M ; Czerniak, BA ; Brown, PH ; Swanton, C ; Msaouel, P ; Maitra, A ; Kopetz, S ; Campbell, P ; Speed, TP ; Boutros, PC ; Zhu, H ; Urbanucci, A ; Demeulemeester, J ; Van Loo, P ; Wang, W (NATURE PORTFOLIO, 2022-11)
    Single-cell RNA sequencing studies have suggested that total mRNA content correlates with tumor phenotypes. Technical and analytical challenges, however, have so far impeded at-scale pan-cancer examination of total mRNA content. Here we present a method to quantify tumor-specific total mRNA expression (TmS) from bulk sequencing data, taking into account tumor transcript proportion, purity and ploidy, which are estimated through transcriptomic/genomic deconvolution. We estimate and validate TmS in 6,590 patient tumors across 15 cancer types, identifying significant inter-tumor variability. Across cancers, high TmS is associated with increased risk of disease progression and death. TmS is influenced by cancer-specific patterns of gene alteration and intra-tumor genetic heterogeneity as well as by pan-cancer trends in metabolic dysregulation. Taken together, our results indicate that measuring cell-type-specific total mRNA expression in tumor cells predicts tumor phenotypes and clinical outcomes.
  • Item
    Thumbnail Image
    RUV-III-NB: normalization of single cell RNA-seq data
    Salim, A ; Molania, R ; Wang, J ; De Livera, A ; Thijssen, R ; Speed, TP (OXFORD UNIV PRESS, 2022-09-09)
    Normalization of single cell RNA-seq data remains a challenging task. The performance of different methods can vary greatly between datasets when unwanted factors and biology are associated. Most normalization methods also only remove the effects of unwanted variation for the cell embedding but not from gene-level data typically used for differential expression (DE) analysis to identify marker genes. We propose RUV-III-NB, a method that can be used to remove unwanted variation from both the cell embedding and gene-level counts. Using pseudo-replicates, RUV-III-NB explicitly takes into account potential association with biology when removing unwanted variation. The method can be used for both UMI or read counts and returns adjusted counts that can be used for downstream analyses such as clustering, DE and pseudotime analyses. Using published datasets with different technological platforms, kinds of biology and levels of association between biology and unwanted variation, we show that RUV-III-NB manages to remove library size and batch effects, strengthen biological signals, improve DE analyses, and lead to results exhibiting greater concordance with independent datasets of the same kind. The performance of RUV-III-NB is consistent and is not sensitive to the number of factors assumed to contribute to the unwanted variation.
  • Item
    Thumbnail Image
    A hierarchical approach to removal of unwanted variation for large-scale metabolomics data
    Kim, T ; Tang, O ; Vernon, ST ; Kott, KA ; Koay, YC ; Park, J ; James, DE ; Grieve, SM ; Speed, TP ; Yang, P ; Figtree, GA ; O'Sullivan, JF ; Yang, JYH (NATURE PORTFOLIO, 2021-08-17)
    Liquid chromatography-mass spectrometry-based metabolomics studies are increasingly applied to large population cohorts, which run for several weeks or even years in data acquisition. This inevitably introduces unwanted intra- and inter-batch variations over time that can overshadow true biological signals and thus hinder potential biological discoveries. To date, normalisation approaches have struggled to mitigate the variability introduced by technical factors whilst preserving biological variance, especially for protracted acquisitions. Here, we propose a study design framework with an arrangement for embedding biological sample replicates to quantify variance within and between batches and a workflow that uses these replicates to remove unwanted variation in a hierarchical manner (hRUV). We use this design to produce a dataset of more than 1000 human plasma samples run over an extended period of time. We demonstrate significant improvement of hRUV over existing methods in preserving biological signals whilst removing unwanted variation for large scale metabolomics studies. Our tools not only provide a strategy for large scale data normalisation, but also provides guidance on the design strategy for large omics studies.
  • Item
    Thumbnail Image
    Strategies to enable large-scale proteomics for reproducible research
    Poulos, RC ; Hains, PG ; Shah, R ; Lucas, N ; Xavier, D ; Manda, SS ; Anees, A ; Koh, JMS ; Mahboob, S ; Wittman, M ; Williams, SG ; Sykes, EK ; Hecker, M ; Dausmann, M ; Wouters, MA ; Ashman, K ; Yang, J ; Wild, PJ ; deFazio, A ; Balleine, RL ; Tully, B ; Aebersold, R ; Speed, TP ; Liu, Y ; Reddel, RR ; Robinson, PJ ; Zhong, Q (NATURE PORTFOLIO, 2020-07-30)
    Reproducible research is the bedrock of experimental science. To enable the deployment of large-scale proteomics, we assess the reproducibility of mass spectrometry (MS) over time and across instruments and develop computational methods for improving quantitative accuracy. We perform 1560 data independent acquisition (DIA)-MS runs of eight samples containing known proportions of ovarian and prostate cancer tissue and yeast, or control HEK293T cells. Replicates are run on six mass spectrometers operating continuously with varying maintenance schedules over four months, interspersed with ~5000 other runs. We utilise negative controls and replicates to remove unwanted variation and enhance biological signal, outperforming existing methods. We also design a method for reducing missing values. Integrating these computational modules into a pipeline (ProNorM), we mitigate variation among instruments over time and accurately predict tissue proportions. We demonstrate how to improve the quantitative analysis of large-scale DIA-MS data, providing a pathway toward clinical proteomics.
  • Item
    Thumbnail Image
    Controlling technical variation amongst 6693 patient microarrays of the randomized MINDACT trial
    Jacob, L ; Witteveen, A ; Beumer, I ; Delahaye, L ; Wehkamp, D ; van den Akker, J ; Snel, M ; Chan, B ; Floore, A ; Bakx, N ; Brink, G ; Poncet, C ; Bogaerts, J ; Delorenzi, M ; Piccart, M ; Rutgers, E ; Cardoso, F ; Speed, T ; van't Veer, L ; Glas, A (NATURE PUBLISHING GROUP, 2020-07-27)
    Gene expression data obtained in large studies hold great promises for discovering disease signatures or subtypes through data analysis. It is also prone to technical variation, whose removal is essential to avoid spurious discoveries. Because this variation is not always known and can be confounded with biological signals, its removal is a challenging task. Here we provide a step-wise procedure and comprehensive analysis of the MINDACT microarray dataset. The MINDACT trial enrolled 6693 breast cancer patients and prospectively validated the gene expression signature MammaPrint for outcome prediction. The study also yielded a full-transcriptome microarray for each tumor. We show for the first time in such a large dataset how technical variation can be removed while retaining expected biological signals. Because of its unprecedented size, we hope the resulting adjusted dataset will be an invaluable tool to discover or test gene expression signatures and to advance our understanding of breast cancer.
  • Item
    Thumbnail Image
    Evaluating stably expressed genes in single cells
    Lin, Y ; Ghazanfar, S ; Strbenac, D ; Wang, A ; Patrick, E ; Lin, DM ; Speed, T ; Yang, JYH ; Yang, P (OXFORD UNIV PRESS, 2019-09)
    BACKGROUND: Single-cell RNA-seq (scRNA-seq) profiling has revealed remarkable variation in transcription, suggesting that expression of many genes at the single-cell level is intrinsically stochastic and noisy. Yet, on the cell population level, a subset of genes traditionally referred to as housekeeping genes (HKGs) are found to be stably expressed in different cell and tissue types. It is therefore critical to question whether stably expressed genes (SEGs) can be identified on the single-cell level, and if so, how can their expression stability be assessed? We have previously proposed a computational framework for ranking expression stability of genes in single cells for scRNA-seq data normalization and integration. In this study, we perform detailed evaluation and characterization of SEGs derived from this framework. RESULTS: Here, we show that gene expression stability indices derived from the early human and mouse development scRNA-seq datasets and the "Mouse Atlas" dataset are reproducible and conserved across species. We demonstrate that SEGs identified from single cells based on their stability indices are considerably more stable than HKGs defined previously from cell populations across diverse biological systems. Our analyses indicate that SEGs are inherently more stable at the single-cell level and their characteristics reminiscent of HKGs, suggesting their potential role in sustaining essential functions in individual cells. CONCLUSIONS: SEGs identified in this study have immediate utility both for understanding variation and stability of single-cell transcriptomes and for practical applications such as scRNA-seq data normalization. Our framework for calculating gene stability index, "scSEGIndex," is incorporated into the scMerge Bioconductor R package (https://sydneybiox.github.io/scMerge/reference/scSEGIndex.html) and can be used for identifying genes with stable expression in scRNA-seq datasets.
  • Item
    Thumbnail Image
    Accurate RNA Sequencing From Formalin-Fixed Cancer Tissue to Represent High-Quality Transcriptome From Frozen Tissue
    Li, J ; Fu, C ; Speed, TP ; Wang, W ; Symmans, WF (AMER SOC CLINICAL ONCOLOGY, 2018-01-26)
    PURPOSE: Accurate transcriptional sequencing (RNA-seq) from formalin-fixation and paraffin-embedding (FFPE) tumor samples presents an important challenge for translational research and diagnostic development. In addition, there are now several different protocols to prepare a sequencing library from total RNA. We evaluated the accuracy of RNA-seq data generated from FFPE samples in terms of expression profiling. METHODS: We designed a biospecimen study to directly compare gene expression results from different protocols to prepare libraries for RNA-seq from human breast cancer tissues, with randomization to fresh-frozen (FF) or FFPE conditions. The protocols were compared using multiple computational methods to assess alignment of reads to reference genome, and the uniformity and continuity of coverage; as well as the variance and correlation, of overall gene expression and patterns of measuring coding sequence, phenotypic patterns of gene expression, and measurements from representative multigene signatures. RESULTS: The principal determinant of variance in gene expression was use of exon capture probes, followed by the conditions of preservation (FF versus FFPE), and phenotypic differences between breast cancers. One protocol, with RNase H-based rRNA depletion, exhibited least variability of gene expression measurements, strongest correlation between FF and FFPE samples, and was generally representative of the transcriptome from standard FF RNA-seq protocols. CONCLUSION: Method of RNA-seq library preparation from FFPE samples had marked effect on the accuracy of gene expression measurement compared to matched FF samples. Nevertheless, some protocols produced highly concordant expression data from FFPE RNA-seq data, compared to RNA-seq results from matched frozen samples.
  • Item
    Thumbnail Image
    Evaluation of cross-platform and interlaboratory concordance via consensus modelling of genomic measurements
    Peters, TJ ; French, HJ ; Bradford, ST ; Pidsley, R ; Stirzaker, C ; Varinli, H ; Nair, S ; Qu, W ; Song, J ; Giles, KA ; Statham, AL ; Speirs, H ; Speed, TP ; Clark, SJ ; Hancock, J (OXFORD UNIV PRESS, 2019-02-15)
    MOTIVATION: A synoptic view of the human genome benefits chiefly from the application of nucleic acid sequencing and microarray technologies. These platforms allow interrogation of patterns such as gene expression and DNA methylation at the vast majority of canonical loci, allowing granular insights and opportunities for validation of original findings. However, problems arise when validating against a "gold standard" measurement, since this immediately biases all subsequent measurements towards that particular technology or protocol. Since all genomic measurements are estimates, in the absence of a "gold standard" we instead empirically assess the measurement precision and sensitivity of a large suite of genomic technologies via a consensus modelling method called the row-linear model. This method is an application of the American Society for Testing and Materials Standard E691 for assessing interlaboratory precision and sources of variability across multiple testing sites. Both cross-platform and cross-locus comparisons can be made across all common loci, allowing identification of technology- and locus-specific tendencies. RESULTS: We assess technologies including the Infinium MethylationEPIC BeadChip, whole genome bisulfite sequencing (WGBS), two different RNA-Seq protocols (PolyA+ and Ribo-Zero) and five different gene expression array platforms. Each technology thus is characterised herein, relative to the consensus. We showcase a number of applications of the row-linear model, including correlation with known interfering traits. We demonstrate a clear effect of cross-hybridisation on the sensitivity of Infinium methylation arrays. Additionally, we perform a true interlaboratory test on a set of samples interrogated on the same platform across twenty-one separate testing laboratories. AVAILABILITY AND IMPLEMENTATION: A full implementation of the row-linear model, plus extra functions for visualisation, are found in the R package consensus at https://github.com/timpeters82/consensus. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
  • Item
    Thumbnail Image
    The healthy ageing gene expression signature for Alzheimer's disease diagnosis: a random sampling perspective
    Jacob, L ; Speed, TP (BMC, 2018-07-25)
    In a recent publication, Sood et al. (Genome Biol 16:185, 2015) presented a set of 150 probe sets that could be used in the diagnosis of Alzheimer's disease (AD) based on gene expression. We reproduce some of their experiments and show that their signature is indeed able to discriminate between AD and control patients using blood gene expression in two cohorts. We also show that its performance does not stand out compared to randomly sampled sets of 150 probe sets from the same array.