School of Mathematics and Statistics - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 14
  • Item
    Thumbnail Image
    Predicting qualitative phenotypes from microarray data - the Eadgene pig data set.
    Robert-Granié, C ; Lê Cao, K-A ; Sancristobal, M (Springer Science and Business Media LLC, 2009-07-16)
    BACKGROUND: The aim of this work was to study the performances of 2 predictive statistical tools on a data set that was given to all participants of the Eadgene-SABRE Post Analyses Working Group, namely the Pig data set of Hazard et al. (2008). The data consisted of 3686 gene expressions measured on 24 animals partitioned in 2 genotypes and 2 treatments. The objective was to find biomarkers that characterized the genotypes and the treatments in the whole set of genes. METHODS: We first considered the Random Forest approach that enables the selection of predictive variables. We then compared the classical Partial Least Squares regression (PLS) with a novel approach called sparse PLS, a variant of PLS that adapts lasso penalization and allows for the selection of a subset of variables. RESULTS: All methods performed well on this data set. The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results. CONCLUSION: We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes. Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.
  • Item
    No Preview Available
    The EADGENE microarray data analysis workshop (open access publication)
    De Koning, D-J ; Jaffrezic, F ; Lund, MS ; Watson, M ; Channing, C ; Hulsegge, I ; Pool, MH ; Buitenhuis, B ; Hedegaard, J ; Hornshoj, H ; Jiang, L ; Sorensen, P ; Marot, G ; Delmas, C ; Le Cao, K-A ; Cristobal, MS ; Baron, MD ; Malinverni, R ; Stella, A ; Brunner, RM ; Seyfert, H-M ; Jensen, K ; Mouzaki, D ; Waddington, D ; Jimenez-Marin, A ; Perez-Alegre, M ; Perez-Reinado, E ; Closset, R ; Detilleux, JC ; Dovc, P ; Lavric, M ; Nie, H ; Janss, L (EDP SCIENCES S A, 2007)
    Microarray analyses have become an important tool in animal genomics. While their use is becoming widespread, there is still a lot of ongoing research regarding the analysis of microarray data. In the context of a European Network of Excellence, 31 researchers representing 14 research groups from 10 countries performed and discussed the statistical analyses of real and simulated 2-colour microarray data that were distributed among participants. The real data consisted of 48 microarrays from a disease challenge experiment in dairy cattle, while the simulated data consisted of 10 microarrays from a direct comparison of two treatments (dye-balanced). While there was broader agreement with regards to methods of microarray normalisation and significance testing, there were major differences with regards to quality control. The quality control approaches varied from none, through using statistical weights, to omitting a large number of spots or omitting entire slides. Surprisingly, these very different approaches gave quite similar results when applied to the simulated data, although not all participating groups analysed both real and simulated data. The workshop was very successful in facilitating interaction between scientists with a diverse background but a common interest in microarray analyses.
  • Item
    No Preview Available
    Analysis of a simulated microarray dataset:: Comparison of methods for data normalisation and detection of differential expression (Open Access publication)
    Watson, M ; Alegre, MP ; Baron, MD ; Delmas, C ; Dovc, P ; Duval, M ; Foulley, JL ; Pavon, JJG ; Hulsegge, I ; Jaffrezic, F ; Marin, AJ ; Lavric, M ; Le Cao, KA ; Marot, G ; Mouzaki, D ; Pool, MH ; Granie, CR ; Cristobal, MS ; Klopp, GT ; Waddington, D ; De Koning, DJ (EDP SCIENCES S A, 2007)
    Microarrays allow researchers to measure the expression of thousands of genes in a single experiment. Before statistical comparisons can be made, the data must be assessed for quality and normalisation procedures must be applied, of which many have been proposed. Methods of comparing the normalised data are also abundant, and no clear consensus has yet been reached. The purpose of this paper was to compare those methods used by the EADGENE network on a very noisy simulated data set. With the a priori knowledge of which genes are differentially expressed, it is possible to compare the success of each approach quantitatively. Use of an intensity-dependent normalisation procedure was common, as was correction for multiple testing. Most variety in performance resulted from differing approaches to data quality and the use of different statistical tests. Very few of the methods used any kind of background correction. A number of approaches achieved a success rate of 95% or above, with relatively small numbers of false positives and negatives. Applying stringent spot selection criteria and elimination of data did not improve the false positive rate and greatly increased the false negative rate. However, most approaches performed well, and it is encouraging that widely available techniques can achieve such good results on a very noisy data set.
  • Item
    No Preview Available
    Analysis of the real EADGENE data set:: Multivariate approaches and post analysis (Open Access publication)
    Sorensen, P ; Bonnet, A ; Buitenhuis, B ; Closset, R ; Dejean, S ; Delmas, C ; Duval, M ; Glass, L ; Hedegaard, J ; Hornshoj, H ; Hulsegge, I ; Jaffrezic, F ; Jensen, K ; Jiang, L ; De Koning, D-J ; Le Cao, K-A ; Nie, H ; Petzl, W ; Pool, MH ; Robert-Granie, C ; Cristobal, MS ; Lund, MS ; Van Schothorst, EM ; Schuberth, H-J ; Seyfert, H-M ; Tosser-Klopp, G ; Waddington, D ; Watson, M ; Yang, W ; Zerbe, H (BMC, 2007)
    The aim of this paper was to describe, and when possible compare, the multivariate methods used by the participants in the EADGENE WP1.4 workshop. The first approach was for class discovery and class prediction using evidence from the data at hand. Several teams used hierarchical clustering (HC) or principal component analysis (PCA) to identify groups of differentially expressed genes with a similar expression pattern over time points and infective agent (E. coli or S. aureus). The main result from these analyses was that HC and PCA were able to separate tissue samples taken at 24 h following E. coli infection from the other samples. The second approach identified groups of differentially co-expressed genes, by identifying clusters of genes highly correlated when animals were infected with E. coli but not correlated more than expected by chance when the infective pathogen was S. aureus. The third approach looked at differential expression of predefined gene sets. Gene sets were defined based on information retrieved from biological databases such as Gene Ontology. Based on these annotation sources the teams used either the GlobalTest or the Fisher exact test to identify differentially expressed gene sets. The main result from these analyses was that gene sets involved in immune defence responses were differentially expressed.
  • Item
    No Preview Available
    Analysis of the real EADGENE data set:: Comparison of methods and guidelines for data normalisation and selection of diffrentially expressed genes (Open Access publication)
    Jaffrezic, F ; De Koning, D-J ; Boettcher, PJ ; Bonnet, A ; Buitenhuis, B ; Closset, R ; Dejean, S ; Delmas, C ; Detilleux, JC ; Dovc, P ; Duval, M ; Foulley, J-L ; Hedegaard, J ; Hornshoj, H ; Hulsegge, I ; Janss, L ; Jensen, K ; Jiang, L ; Lavric, M ; Le Cao, K-A ; Lund, MS ; Malinverni, R ; Marot, G ; Nie, H ; Petzl, W ; Pool, MH ; Granie, CR ; Cristobal, MS ; Van Schothorst, EM ; Schuberth, H-J ; Sorensen, P ; Stella, A ; Tosser-Klopp, G ; Waddington, D ; Watson, M ; Yang, W ; Zerbe, H ; Seyfert, H-M (BMC, 2007)
    A large variety of methods has been proposed in the literature for microarray data analysis. The aim of this paper was to present techniques used by the EADGENE (European Animal Disease Genomics Network of Excellence) WP1.4 participants for data quality control, normalisation and statistical methods for the detection of differentially expressed genes in order to provide some more general data analysis guidelines. All the workshop participants were given a real data set obtained in an EADGENE funded microarray study looking at the gene expression changes following artificial infection with two different mastitis causing bacteria: Escherichia coli and Staphylococcus aureus. It was reassuring to see that most of the teams found the same main biological results. In fact, most of the differentially expressed genes were found for infection by E. coli between uninfected and 24 h challenged udder quarters. Very little transcriptional variation was observed for the bacteria S. aureus. Lists of differentially expressed genes found by the different research teams were, however, quite dependent on the method used, especially concerning the data quality control step. These analyses also emphasised a biological problem of cross-talk between infected and uninfected quarters which will have to be dealt with for further microarray studies.
  • Item
    Thumbnail Image
    Visualising associations between paired 'omics' data sets
    Gonzalez, I ; Le Cao, K-A ; Davis, MJ ; Dejean, S (BMC, 2012-11-13)
    BACKGROUND: Each omics platform is now able to generate a large amount of data. Genomics, proteomics, metabolomics, interactomics are compiled at an ever increasing pace and now form a core part of the fundamental systems biology framework. Recently, several integrative approaches have been proposed to extract meaningful information. However, these approaches lack of visualisation outputs to fully unravel the complex associations between different biological entities. RESULTS: The multivariate statistical approaches 'regularized Canonical Correlation Analysis' and 'sparse Partial Least Squares regression' were recently developed to integrate two types of highly dimensional 'omics' data and to select relevant information. Using the results of these methods, we propose to revisit few graphical outputs to better understand the relationships between two 'omics' data and to better visualise the correlation structure between the different biological entities. These graphical outputs include Correlation Circle plots, Relevance Networks and Clustered Image Maps. We demonstrate the usefulness of such graphical outputs on several biological data sets and further assess their biological relevance using gene ontology analysis. CONCLUSIONS: Such graphical outputs are undoubtedly useful to aid the interpretation of these promising integrative analysis tools and will certainly help in addressing fundamental biological questions and understanding systems as a whole. AVAILABILITY: The graphical tools described in this paper are implemented in the freely available R package mixOmics and in its associated web application.
  • Item
    Thumbnail Image
    Integrative mixture of experts to combine clinical factors and gene markers
    Le Cao, K-A ; Meugnier, E ; McLachlan, GJ (OXFORD UNIV PRESS, 2010-05-01)
    MOTIVATION: Microarrays are being increasingly used in cancer research to better characterize and classify tumors by selecting marker genes. However, as very few of these genes have been validated as predictive biomarkers so far, it is mostly conventional clinical and pathological factors that are being used as prognostic indicators of clinical course. Combining clinical data with gene expression data may add valuable information, but it is a challenging task due to their categorical versus continuous characteristics. We have further developed the mixture of experts (ME) methodology, a promising approach to tackle complex non-linear problems. Several variants are proposed in integrative ME as well as the inclusion of various gene selection methods to select a hybrid signature. RESULTS: We show on three cancer studies that prediction accuracy can be improved when combining both types of variables. Furthermore, the selected genes were found to be of high relevance and can be considered as potential biomarkers for the prognostic selection of cancer therapy. AVAILABILITY: Integrative ME is implemented in the R package integrativeME (http://cran.r-project.org/).
  • Item
    Thumbnail Image
    A novel approach for biomarker selection and the integration of repeated measures experiments from two assays
    Liquet, B ; Le Cao, K-A ; Hocini, H ; Thiebaut, R (BMC, 2012-12-06)
    BACKGROUND: High throughput 'omics' experiments are usually designed to compare changes observed between different conditions (or interventions) and to identify biomarkers capable of characterizing each condition. We consider the complex structure of repeated measurements from different assays where different conditions are applied on the same subjects. RESULTS: We propose a two-step analysis combining a multilevel approach and a multivariate approach to reveal separately the effects of conditions within subjects from the biological variation between subjects. The approach is extended to two-factor designs and to the integration of two matched data sets. It allows internal variable selection to highlight genes able to discriminate the net condition effect within subjects. A simulation study was performed to demonstrate the good performance of the multilevel multivariate approach compared to a classical multivariate method. The multilevel multivariate approach outperformed the classical multivariate approach with respect to the classification error rate and the selection of relevant genes. The approach was applied to an HIV-vaccine trial evaluating the response with gene expression and cytokine secretion. The discriminant multilevel analysis selected a relevant subset of genes while the integrative multilevel analysis highlighted clusters of genes and cytokines that were highly correlated across the samples. CONCLUSIONS: Our combined multilevel multivariate approach may help in finding signatures of vaccine effect and allows for a better understanding of immunological mechanisms activated by the intervention. The integrative analysis revealed clusters of genes, that were associated with cytokine secretion. These clusters can be seen as gene signatures to predict future cytokine response. The approach is implemented in the R package mixOmics (http://cran.r-project.org/) with associated tutorials to perform the analysis(a).
  • Item
    Thumbnail Image
    Sparse canonical methods for biological data integration: application to a cross-platform study
    Le Cao, K-A ; Martin, PGP ; Robert-Granie, C ; Besse, P (BIOMED CENTRAL LTD, 2009-01-26)
    BACKGROUND: In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines. RESULTS: We compare the results obtained with two other sparse or related canonical correlation approaches: CCA with Elastic Net penalization (CCA-EN) and Co-Inertia Analysis (CIA). The latter does not include a built-in procedure for variable selection and requires a two-step analysis. We stress the lack of statistical criteria to evaluate canonical correlation methods, which makes biological interpretation absolutely necessary to compare the different gene selections. We also propose comprehensive graphical representations of both samples and variables to facilitate the interpretation of the results. CONCLUSION: sPLS and CCA-EN selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines. These two approaches were found to bring similar results, although they highlighted the same phenomenons with a different priority. They outperformed CIA that tended to select redundant information.
  • Item
    Thumbnail Image
    Uncoupled Embryonic and Extra-Embryonic Tissues Compromise Blastocyst Development after Somatic Cell Nuclear Transfer
    Degrelle, SA ; Jaffrezic, F ; Campion, E ; Le Cao, K-A ; Le Bourhis, D ; Richard, C ; Rodde, N ; Fleurot, R ; Everts, RE ; Lecardonnel, J ; Heyman, Y ; Vignon, X ; Yang, X ; Tian, XC ; Lewin, HA ; Renard, J-P ; Hue, I ; Akagi, T (PUBLIC LIBRARY SCIENCE, 2012-06-06)
    Somatic cell nuclear transfer (SCNT) is the most efficient cell reprogramming technique available, especially when working with bovine species. Although SCNT blastocysts performed equally well or better than controls in the weeks following embryo transfer at Day 7, elongation and gastrulation defects were observed prior to implantation. To understand the developmental implications of embryonic/extra-embryonic interactions, the morphological and molecular features of elongating and gastrulating tissues were analysed. At Day 18, 30 SCNT conceptuses were compared to 20 controls (AI and IVP: 10 conceptuses each); one-half of the SCNT conceptuses appeared normal while the other half showed signs of atypical elongation and gastrulation. SCNT was also associated with a high incidence of discordance in embryonic and extra-embryonic patterns, as evidenced by morphological and molecular "uncoupling". Elongation appeared to be secondarily affected; only 3 of 30 conceptuses had abnormally elongated shapes and there were very few differences in gene expression when they were compared to the controls. However, some of these differences could be linked to defects in microvilli formation or extracellular matrix composition and could thus impact extra-embryonic functions. In contrast to elongation, gastrulation stages included embryonic defects that likely affected the hypoblast, the epiblast, or the early stages of their differentiation. When taking into account SCNT conceptus somatic origin, i.e. the reprogramming efficiency of each bovine ear fibroblast (Low: 0029, Med: 7711, High: 5538), we found that embryonic abnormalities or severe embryonic/extra-embryonic uncoupling were more tightly correlated to embryo loss at implantation than were elongation defects. Alternatively, extra-embryonic differences between SCNT and control conceptuses at Day 18 were related to molecular plasticity (high efficiency/high plasticity) and subsequent pregnancy loss. Finally, because it alters re-differentiation processes in vivo, SCNT reprogramming highlights temporally and spatially restricted interactions among cells and tissues in a unique way.