School of BioSciences - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 10
  • Item
    No Preview Available
    Evolution of pathogenicity and sexual reproduction in eight Candida genomes
    Butler, G ; Rasmussen, MD ; Lin, MF ; Santos, MAS ; Sakthikumar, S ; Munro, CA ; Rheinbay, E ; Grabherr, M ; Forche, A ; Reedy, JL ; Agrafioti, I ; Arnaud, MB ; Bates, S ; Brown, AJP ; Brunke, S ; Costanzo, MC ; Fitzpatrick, DA ; de Groot, PWJ ; Harris, D ; Hoyer, LL ; Hube, B ; Klis, FM ; Kodira, C ; Lennard, N ; Logue, ME ; Martin, R ; Neiman, AM ; Nikolaou, E ; Quail, MA ; Quinn, J ; Santos, MC ; Schmitzberger, FF ; Sherlock, G ; Shah, P ; Silverstein, KAT ; Skrzypek, MS ; Soll, D ; Staggs, R ; Stansfield, I ; Stumpf, MPH ; Sudbery, PE ; Srikantha, T ; Zeng, Q ; Berman, J ; Berriman, M ; Heitman, J ; Gow, NAR ; Lorenz, MC ; Birren, BW ; Kellis, M ; Cuomo, CA (NATURE PUBLISHING GROUP, 2009-06-04)
    Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes.
  • Item
    No Preview Available
    Evolving proteins at Darwin's bicentenary.
    Pinney, JW ; Stumpf, MPH (Springer Science and Business Media LLC, 2009)
    A report of the Biochemical Society/Wellcome Trust meeting 'Protein Evolution - Sequences, Structures and Systems', Hinxton, UK, 26-27 January 2009.
  • Item
    Thumbnail Image
    Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data
    Kirk, PDW ; Stumpf, MPH (OXFORD UNIV PRESS, 2009-05-15)
    MOTIVATION: Although widely accepted that high-throughput biological data are typically highly noisy, the effects that this uncertainty has upon the conclusions we draw from these data are often overlooked. However, in order to assign any degree of confidence to our conclusions, we must quantify these effects. Bootstrap resampling is one method by which this may be achieved. Here, we present a parametric bootstrapping approach for time-course data, in which Gaussian process regression (GPR) is used to fit a probabilistic model from which replicates may then be drawn. This approach implicitly allows the time dependence of the data to be taken into account, and is applicable to a wide range of problems. RESULTS: We apply GPR bootstrapping to two datasets from the literature. In the first example, we show how the approach may be used to investigate the effects of data uncertainty upon the estimation of parameters in an ordinary differential equations (ODE) model of a cell signalling pathway. Although we find that the parameter estimates inferred from the original dataset are relatively robust to data uncertainty, we also identify a distinct second set of estimates. In the second example, we use our method to show that the topology of networks constructed from time-course gene expression data appears to be sensitive to data uncertainty, although there may be individual edges in the network that are robust in light of present data. AVAILABILITY: Matlab code for performing GPR bootstrapping is available from our web site: http://www3.imperial.ac.uk/theoreticalsystemsbiology/data-software/.
  • Item
    Thumbnail Image
    The effects of incomplete protein interaction data on structural and evolutionary inferences
    de Silva, E ; Thorne, T ; Ingram, P ; Agrafioti, I ; Swire, J ; Wiuf, C ; Stumpf, MPH (BMC, 2006-11-03)
    BACKGROUND: Present protein interaction network data sets include only interactions among subsets of the proteins in an organism. Previously this has been ignored, but in principle any global network analysis that only looks at partial data may be biased. Here we demonstrate the need to consider network sampling properties explicitly and from the outset in any analysis. RESULTS: Here we study how properties of the yeast protein interaction network are affected by random and non-random sampling schemes using a range of different network statistics. Effects are shown to be independent of the inherent noise in protein interaction data. The effects of the incomplete nature of network data become very noticeable, especially for so-called network motifs. We also consider the effect of incomplete network data on functional and evolutionary inferences. CONCLUSION: Crucially, when only small, partial network data sets are considered, bias is virtually inevitable. Given the scope of effects considered here, previous analyses may have to be carefully reassessed: ignoring the fact that present network data are incomplete will severely affect our ability to understand biological systems.
  • Item
    Thumbnail Image
    Using likelihood-free inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum
    Ratmann, O ; Jorgensen, O ; Hinkley, T ; Stumpf, M ; Richardson, S ; Wiuf, C ; Bonhoeffer, S (PUBLIC LIBRARY SCIENCE, 2007-11)
    Gene duplication with subsequent interaction divergence is one of the primary driving forces in the evolution of genetic systems. Yet little is known about the precise mechanisms and the role of duplication divergence in the evolution of protein networks from the prokaryote and eukaryote domains. We developed a novel, model-based approach for Bayesian inference on biological network data that centres on approximate Bayesian computation, or likelihood-free inference. Instead of computing the intractable likelihood of the protein network topology, our method summarizes key features of the network and, based on these, uses a MCMC algorithm to approximate the posterior distribution of the model parameters. This allowed us to reliably fit a flexible mixture model that captures hallmarks of evolution by gene duplication and subfunctionalization to protein interaction network data of Helicobacter pylori and Plasmodium falciparum. The 80% credible intervals for the duplication-divergence component are [0.64, 0.98] for H. pylori and [0.87, 0.99] for P. falciparum. The remaining parameter estimates are not inconsistent with sequence data. An extensive sensitivity analysis showed that incompleteness of PIN data does not largely affect the analysis of models of protein network evolution, and that the degree sequence alone barely captures the evolutionary footprints of protein networks relative to other statistics. Our likelihood-free inference approach enables a fully Bayesian analysis of a complex and highly stochastic system that is otherwise intractable at present. Modelling the evolutionary history of PIN data, it transpires that only the simultaneous analysis of several global aspects of protein networks enables credible and consistent inference to be made from available datasets. Our results indicate that gene duplication has played a larger part in the network evolution of the eukaryote than in the prokaryote, and suggests that single gene duplications with immediate divergence alone may explain more than 60% of biological network data in both domains.
  • Item
    Thumbnail Image
    SNPSTR: a database of compound microsatellite-SNP markers
    Agrafioti, I ; Stumpf, MPH (OXFORD UNIV PRESS, 2007-01)
    There has been widespread and growing interest in genetic markers suitable for drawing population genetic inferences about past demographic events and to detect the effects of selection. In addition to single nucleotide polymorphisms (SNPs), microsatellites (or short tandem repeats, STRs) have received great attention in the analysis of human population history. In the SNPSTR database (http://www.imperial.ac.uk/theoreticalgenomics/data-software) we catalogue a relatively new type of compound genetic marker called SNPSTR which combines a microsatellite marker (STR) with one or more tightly linked SNPs. Here, the SNP(s) and the microsatellite are less than 250 bp apart so each SNPSTR can be considered a small haplotype with no recombination occurring between the two individual markers. Thus, SNPSTRs have the potential to become a very useful tool in the field of population genetics. The SNPSTR database contains all inferable human SNPSTRs as well as those in mouse, rat, dog and chicken, i.e. all model organisms for which extensive SNP datasets are available.
  • Item
    Thumbnail Image
    Phylogenetic diversity of stress signalling pathways in fungi
    Nikolaou, E ; Agrafioti, I ; Stumpf, M ; Quinn, J ; Stansfield, I ; Brown, AJP (BIOMED CENTRAL LTD, 2009-02-21)
    BACKGROUND: Microbes must sense environmental stresses, transduce these signals and mount protective responses to survive in hostile environments. In this study we have tested the hypothesis that fungal stress signalling pathways have evolved rapidly in a niche-specific fashion that is independent of phylogeny. To test this hypothesis we have compared the conservation of stress signalling molecules in diverse fungal species with their stress resistance. These fungi, which include ascomycetes, basidiomycetes and microsporidia, occupy highly divergent niches from saline environments to plant or mammalian hosts. RESULTS: The fungi displayed significant variation in their resistance to osmotic (NaCl and sorbitol), oxidative (H2O2 and menadione) and cell wall stresses (Calcofluor White and Congo Red). There was no strict correlation between fungal phylogeny and stress resistance. Rather, the human pathogens tended to be more resistant to all three types of stress, an exception being the sensitivity of Candida albicans to the cell wall stress, Calcofluor White. In contrast, the plant pathogens were relatively sensitive to oxidative stress. The degree of conservation of osmotic, oxidative and cell wall stress signalling pathways amongst the eighteen fungal species was examined. Putative orthologues of functionally defined signalling components in Saccharomyces cerevisiae were identified by performing reciprocal BLASTP searches, and the percent amino acid identities of these orthologues recorded. This revealed that in general, central components of the osmotic, oxidative and cell wall stress signalling pathways are relatively well conserved, whereas the sensors lying upstream and transcriptional regulators lying downstream of these modules have diverged significantly. There was no obvious correlation between the degree of conservation of stress signalling pathways and the resistance of a particular fungus to the corresponding stress. CONCLUSION: Our data are consistent with the hypothesis that fungal stress signalling components have undergone rapid recent evolution to tune the stress responses in a niche-specific fashion.
  • Item
    Thumbnail Image
    Network motifs: structure does not determine function
    Ingram, PJ ; Stumpf, MPH ; Stark, J (BMC, 2006-05-05)
    BACKGROUND: A number of publications have recently examined the occurrence and properties of the feed-forward motif in a variety of networks, including those that are of interest in genome biology, such as gene networks. The present work looks in some detail at the dynamics of the bi-fan motif, using systems of ordinary differential equations to model the populations of transcription factors, mRNA and protein, with the aim of extending our understanding of what appear to be important building blocks of gene network structure. RESULTS: We develop an ordinary differential equation model of the bi-fan motif and analyse variants of the motif corresponding to its behaviour under various conditions. In particular, we examine the effects of different steady and pulsed inputs to five variants of the bifan motif, based on evidence in the literature of bifan motifs found in Saccharomyces cerevisiae (commonly known as baker's yeast). Using this model, we characterize the dynamical behaviour of the bi-fan motif for a wide range of biologically plausible parameters and configurations. We find that there is no characteristic behaviour for the motif, and with the correct choice of parameters and of internal structure, very different, indeed even opposite behaviours may be obtained. CONCLUSION: Even with this relatively simple model, the bi-fan motif can exhibit a wide range of dynamical responses. This suggests that it is difficult to gain significant insights into biological function simply by considering the connection architecture of a gene network, or its decomposition into simple structural motifs. It is necessary to supplement such structural information by kinetic parameters, or dynamic time series experimental data, both of which are currently difficult to obtain.
  • Item
    Thumbnail Image
    Generating confidence intervals on biological networks
    Thorne, T ; Stumpf, MPH (BMC, 2007-11-30)
    BACKGROUND: In the analysis of networks we frequently require the statistical significance of some network statistic, such as measures of similarity for the properties of interacting nodes. The structure of the network may introduce dependencies among the nodes and it will in general be necessary to account for these dependencies in the statistical analysis. To this end we require some form of Null model of the network: generally rewired replicates of the network are generated which preserve only the degree (number of interactions) of each node. We show that this can fail to capture important features of network structure, and may result in unrealistic significance levels, when potentially confounding additional information is available. METHODS: We present a new network resampling Null model which takes into account the degree sequence as well as available biological annotations. Using gene ontology information as an illustration we show how this information can be accounted for in the resampling approach, and the impact such information has on the assessment of statistical significance of correlations and motif-abundances in the Saccharomyces cerevisiae protein interaction network. An algorithm, GOcardShuffle, is introduced to allow for the efficient construction of an improved Null model for network data. RESULTS: We use the protein interaction network of S. cerevisiae; correlations between the evolutionary rates and expression levels of interacting proteins and their statistical significance were assessed for Null models which condition on different aspects of the available data. The novel GOcardShuffle approach results in a Null model for annotated network data which appears better to describe the properties of real biological networks. CONCLUSION: An improved statistical approach for the statistical analysis of biological network data, which conditions on the available biological information, leads to qualitatively different results compared to approaches which ignore such annotations. In particular we demonstrate the effects of the biological organization of the network can be sufficient to explain the observed similarity of interacting proteins.
  • Item
    Thumbnail Image
    Nonidentifiability of the Source of Intrinsic Noise in Gene Expression from Single-Burst Data
    Ingram, PJ ; Stumpf, MPH ; Stark, J ; Bourne, PE (PUBLIC LIBRARY SCIENCE, 2008-10)
    Over the last few years, experimental data on the fluctuations in gene activity between individual cells and within the same cell over time have confirmed that gene expression is a "noisy" process. This variation is in part due to the small number of molecules taking part in some of the key reactions that are involved in gene expression. One of the consequences of this is that protein production often occurs in bursts, each due to a single promoter or transcription factor binding event. Recently, the distribution of the number of proteins produced in such bursts has been experimentally measured, offering a unique opportunity to study the relative importance of different sources of noise in gene expression. Here, we provide a derivation of the theoretical probability distribution of these bursts for a wide variety of different models of gene expression. We show that there is a good fit between our theoretical distribution and that obtained from two different published experimental datasets. We then prove that, irrespective of the details of the model, the burst size distribution is always geometric and hence determined by a single parameter. Many different combinations of the biochemical rates for the constituent reactions of both transcription and translation will therefore lead to the same experimentally observed burst size distribution. It is thus impossible to identify different sources of fluctuations purely from protein burst size data or to use such data to estimate all of the model parameters. We explore methods of inferring these values when additional types of experimental data are available.