School of Mathematics and Statistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 12
  • Item
    Thumbnail Image
    A guide to creating design matrices for gene expression experiments.
    Law, CW ; Zeglinski, K ; Dong, X ; Alhamdoosh, M ; Smyth, GK ; Ritchie, ME (F1000 Research Ltd, 2020)
    Differential expression analysis of genomic data types, such as RNA-sequencing experiments, use linear models to determine the size and direction of the changes in gene expression. For RNA-sequencing, there are several established software packages for this purpose accompanied with analysis pipelines that are well described. However, there are two crucial steps in the analysis process that can be a stumbling block for many -- the set up an appropriate model via design matrices and the set up of comparisons of interest via contrast matrices. These steps are particularly troublesome because an extensive catalogue for design and contrast matrices does not currently exist. One would usually search for example case studies across different platforms and mix and match the advice from those sources to suit the dataset they have at hand. This article guides the reader through the basics of how to set up design and contrast matrices. We take a practical approach by providing code and graphical representation of each case study, starting with simpler examples (e.g. models with a single explanatory variable) and move onto more complex ones (e.g. interaction models, mixed effects models, higher order time series and cyclical models). Although our work has been written specifically with a limma-style pipeline in mind, most of it is also applicable to other software packages for differential expression analysis, and the ideas covered can be adapted to data analysis of other high-throughput technologies. Where appropriate, we explain the interpretation and differences between models to aid readers in their own model choices. Unnecessary jargon and theory is omitted where possible so that our work is accessible to a wide audience of readers, from beginners to those with experience in genomics data analysis.
  • Item
    Thumbnail Image
    Empirical array quality weights in the analysis of microarray data
    Ritchie, ME ; Diyagama, D ; Neilson, J ; van Laar, R ; Dobrovic, A ; Holloway, A ; Smyth, GK (BMC, 2006-05-19)
    BACKGROUND: Assessment of array quality is an essential step in the analysis of data from microarray experiments. Once detected, less reliable arrays are typically excluded or "filtered" from further analysis to avoid misleading results. RESULTS: In this article, a graduated approach to array quality is considered based on empirical reproducibility of the gene expression measures from replicate arrays. Weights are assigned to each microarray by fitting a heteroscedastic linear model with shared array variance terms. A novel gene-by-gene update algorithm is used to efficiently estimate the array variances. The inverse variances are used as weights in the linear model analysis to identify differentially expressed genes. The method successfully assigns lower weights to less reproducible arrays from different experiments. Down-weighting the observations from suspect arrays increases the power to detect differential expression. In smaller experiments, this approach outperforms the usual method of filtering the data. The method is available in the limma software package which is implemented in the R software environment. CONCLUSION: This method complements existing normalisation and spot quality procedures, and allows poorer quality arrays, which would otherwise be discarded, to be included in an analysis. It is applicable to microarray data from experiments with some level of replication.
  • Item
    Thumbnail Image
    Integrative analysis of RUNX1 downstream pathways and target genes
    Michaud, J ; Simpson, KM ; Escher, R ; Buchet-Poyau, K ; Beissbarth, T ; Carmichael, C ; Ritchie, ME ; Schuetz, F ; Cannon, P ; Liu, M ; Shen, X ; Ito, Y ; Raskind, WH ; Horwitz, MS ; Osato, M ; Turner, DR ; Speed, TP ; Kavallaris, M ; Smyth, GK ; Scott, HS (BMC, 2008-07-31)
    BACKGROUND: The RUNX1 transcription factor gene is frequently mutated in sporadic myeloid and lymphoid leukemia through translocation, point mutation or amplification. It is also responsible for a familial platelet disorder with predisposition to acute myeloid leukemia (FPD-AML). The disruption of the largely unknown biological pathways controlled by RUNX1 is likely to be responsible for the development of leukemia. We have used multiple microarray platforms and bioinformatic techniques to help identify these biological pathways to aid in the understanding of why RUNX1 mutations lead to leukemia. RESULTS: Here we report genes regulated either directly or indirectly by RUNX1 based on the study of gene expression profiles generated from 3 different human and mouse platforms. The platforms used were global gene expression profiling of: 1) cell lines with RUNX1 mutations from FPD-AML patients, 2) over-expression of RUNX1 and CBFbeta, and 3) Runx1 knockout mouse embryos using either cDNA or Affymetrix microarrays. We observe that our datasets (lists of differentially expressed genes) significantly correlate with published microarray data from sporadic AML patients with mutations in either RUNX1 or its cofactor, CBFbeta. A number of biological processes were identified among the differentially expressed genes and functional assays suggest that heterozygous RUNX1 point mutations in patients with FPD-AML impair cell proliferation, microtubule dynamics and possibly genetic stability. In addition, analysis of the regulatory regions of the differentially expressed genes has for the first time systematically identified numerous potential novel RUNX1 target genes. CONCLUSION: This work is the first large-scale study attempting to identify the genetic networks regulated by RUNX1, a master regulator in the development of the hematopoietic system and leukemia. The biological pathways and target genes controlled by RUNX1 will have considerable importance in disease progression in both familial and sporadic leukemia as well as therapeutic implications.
  • Item
    Thumbnail Image
    Illumina WG-6 BeadChip strips should be normalized separately
    Shi, W ; Banerjee, A ; Ritchie, ME ; Gerondakis, S ; Smyth, GK (BMC, 2009-11-11)
    BACKGROUND: Illumina Sentrix-6 Whole-Genome Expression BeadChips are relatively new microarray platforms which have been used in many microarray studies in the past few years. These Chips have a unique design in which each Chip contains six microarrays and each microarray consists of two separate physical strips, posing special challenges for precise between-array normalization of expression values. RESULTS: None of the normalization strategies proposed so far for this microarray platform allow for the possibility of systematic variation between the two strips comprising each array. That this variation can be substantial is illustrated by a data example. We demonstrate that normalizing at the strip-level rather than at the array-level can effectively remove this between-strip variation, improve the precision of gene expression measurements and discover more differentially expressed genes. The gain is substantial, yielding a 20% increase in statistical information and doubling the number of genes detected at a 5% false discovery rate. Functional analysis reveals that the extra genes found tend to have interesting biological meanings, dramatically strengthening the biological conclusions from the experiment. Strip-level normalization still outperforms array-level normalization when non-expressed probes are filtered out. CONCLUSION: Plots are proposed which demonstrate how the need for strip-level normalization relates to inconsistent intensity range variation between the strips. Strip-level normalization is recommended for the preprocessing of Illumina Sentrix-6 BeadChips whenever the intensity range is seen to be inconsistent between the strips. R code is provided to implement the recommended plots and normalization algorithms.
  • Item
    Thumbnail Image
    limma powers differential expression analyses for RNA-sequencing and microarray studies
    Ritchie, ME ; Phipson, B ; Wu, D ; Hu, Y ; Law, CW ; Shi, W ; Smyth, GK (OXFORD UNIV PRESS, 2015-04-20)
    limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.
  • Item
    Thumbnail Image
    Microarray background correction: maximum likelihood estimation for the normal-exponential convolution
    Silver, JD ; Ritchie, ME ; Smyth, GK (OXFORD UNIV PRESS, 2009-04-01)
    Background correction is an important preprocessing step for microarray data that attempts to adjust the data for the ambient intensity surrounding each feature. The "normexp" method models the observed pixel intensities as the sum of 2 random variables, one normally distributed and the other exponentially distributed, representing background noise and signal, respectively. Using a saddle-point approximation, Ritchie and others (2007) found normexp to be the best background correction method for 2-color microarray data. This article develops the normexp method further by improving the estimation of the parameters. A complete mathematical development is given of the normexp model and the associated saddle-point approximation. Some subtle numerical programming issues are solved which caused the original normexp method to fail occasionally when applied to unusual data sets. A practical and reliable algorithm is developed for exact maximum likelihood estimation (MLE) using high-quality optimization software and using the saddle-point estimates as starting values. "MLE" is shown to outperform heuristic estimators proposed by other authors, both in terms of estimation accuracy and in terms of performance on real data. The saddle-point approximation is an adequate replacement in most practical situations. The performance of normexp for assessing differential expression is improved by adding a small offset to the corrected intensities.
  • Item
    Thumbnail Image
    A pooled shRNA screen for regulators of primary mammary stem and progenitor cells identifies roles for Asap1 and Prox1
    Sheridan, JM ; Ritchie, ME ; Best, SA ; Jiang, K ; Beck, TJ ; Vaillant, F ; Liu, K ; Dickins, RA ; Smyth, GK ; Lindeman, GJ ; Visvader, JE (BMC, 2015-04-03)
    BACKGROUND: The molecular regulators that orchestrate stem cell renewal, proliferation and differentiation along the mammary epithelial hierarchy remain poorly understood. Here we have performed a large-scale pooled RNAi screen in primary mouse mammary stem cell (MaSC)-enriched basal cells using 1295 shRNAs against genes principally involved in transcriptional regulation. METHODS: MaSC-enriched basal cells transduced with lentivirus pools carrying shRNAs were maintained as non-adherent mammospheres, a system known to support stem and progenitor cells. Integrated shRNAs that altered culture kinetics were identified by next generation sequencing as relative frequency changes over time. RNA-seq-based expression profiling coupled with in vitro progenitor and in vivo transplantation assays was used to confirm a role for candidate genes in mammary stem and/or progenitor cells. RESULTS: Utilizing a mammosphere-based assay, the screen identified several candidate regulators. Although some genes had been previously implicated in mammary gland development, the vast majority of genes uncovered have no known function within the mammary gland. RNA-seq analysis of freshly purified primary mammary epithelial populations and short-term cultured mammospheres was used to confirm the expression of candidate regulators. Two genes, Asap1 and Prox1, respectively implicated in breast cancer metastasis and progenitor cell function in other systems, were selected for further analysis as their roles in the normal mammary gland were unknown. Both Prox1 and Asap1 were shown to act as negative regulators of progenitor activity in vitro, and Asap1 knock-down led to a marked increase in repopulating activity in vivo, implying a role in stem cell activity. CONCLUSIONS: This study has revealed a number of novel genes that influence the activity or survival of mammary stem and/or progenitor cells. Amongst these, we demonstrate that Prox1 and Asap1 behave as negative regulators of mammary stem/progenitor function. Both of these genes have also been implicated in oncogenesis. Our findings provide proof of principle for the use of short-term cultured primary MaSC/basal cells in functional RNAi screens.
  • Item
    Thumbnail Image
    RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR.
    Law, CW ; Alhamdoosh, M ; Su, S ; Dong, X ; Tian, L ; Smyth, GK ; Ritchie, ME (F1000 Research Ltd, 2016)
    The ability to easily and efficiently analyse RNA-sequencing data is a key strength of the Bioconductor project. Starting with counts summarised at the gene-level, a typical analysis involves pre-processing, exploratory data analysis, differential expression testing and pathway analysis with the results obtained informing future experiments and validation studies. In this workflow article, we analyse RNA-sequencing data from the mouse mammary gland, demonstrating use of the popular edgeR package to import, organise, filter and normalise the data, followed by the limma package with its voom method, linear modelling and empirical Bayes moderation to assess differential expression and perform gene set testing. This pipeline is further enhanced by the Glimma package which enables interactive exploration of the results so that individual samples and genes can be examined by the user. The complete analysis offered by these three packages highlights the ease with which researchers can turn the raw counts from an RNA-sequencing experiment into biological insights using Bioconductor.
  • Item
    Thumbnail Image
    RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods
    Holik, AZ ; Law, CW ; Liu, R ; Wang, Z ; Wang, W ; Ahn, J ; Asselin-Labat, M-L ; Smyth, GK ; Ritchie, ME (OXFORD UNIV PRESS, 2017-03-17)
    Carefully designed control experiments provide a gold standard for benchmarking different genomics research tools. A shortcoming of many gene expression control studies is that replication involves profiling the same reference RNA sample multiple times. This leads to low, pure technical noise that is atypical of regular studies. To achieve a more realistic noise structure, we generated a RNA-sequencing mixture experiment using two cell lines of the same cancer type. Variability was added by extracting RNA from independent cell cultures and degrading particular samples. The systematic gene expression changes induced by this design allowed benchmarking of different library preparation kits (standard poly-A versus total RNA with Ribozero depletion) and analysis pipelines. Data generated using the total RNA kit had more signal for introns and various RNA classes (ncRNA, snRNA, snoRNA) and less variability after degradation. For differential expression analysis, voom with quality weights marginally outperformed other popular methods, while for differential splicing, DEXSeq was simultaneously the most sensitive and the most inconsistent method. For sample deconvolution analysis, DeMix outperformed IsoPure convincingly. Our RNA-sequencing data set provides a valuable resource for benchmarking different protocols and data pre-processing workflows. The extra noise mimics routine lab experiments more closely, ensuring any conclusions are widely applicable.
  • Item
    No Preview Available
    Germline heterozygous mutations in Nxf1 perturb RNA metabolism and trigger thrombocytopenia and lymphopenia in mice
    Chappaz, S ; Law, CW ; Dowling, MR ; Carey, KT ; Lane, RM ; Ngo, LH ; Wickramasinghe, VO ; Smyth, GK ; Ritchie, ME ; Kile, BT (ELSEVIER, 2020-04-14)
    In eukaryotic cells, messenger RNA (mRNA) molecules are exported from the nucleus to the cytoplasm, where they are translated. The highly conserved protein nuclear RNA export factor1 (Nxf1) is an important mediator of this process. Although studies in yeast and in human cell lines have shed light on the biochemical mechanisms of Nxf1 function, its contribution to mammalian physiology is less clear. Several groups have identified recurrent NXF1 mutations in chronic lymphocytic leukemia (CLL), placing it alongside several RNA-metabolism factors (including SF3B1, XPO, RPS15) whose dysregulation is thought to contribute to CLL pathogenesis. We report here an allelic series of germline point mutations in murine Nxf1. Mice heterozygous for these loss-of-function Nxf1 mutations exhibit thrombocytopenia and lymphopenia, together with milder hematological defects. This is primarily caused by cell-intrinsic defects in the survival of platelets and peripheral lymphocytes, which are sensitized to intrinsic apoptosis. In contrast, Nxf1 mutations have almost no effect on red blood cell homeostasis. Comparative transcriptome analysis of platelets, lymphocytes, and erythrocytes from Nxf1-mutant mice shows that, in response to impaired Nxf1 function, the cytoplasmic representation of transcripts encoding regulators of RNA metabolism is altered in a unique, lineage-specific way. Thus, blood cell lineages exhibit differential requirements for Nxf1-mediated global mRNA export.