Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Computational methods for understanding the molecular mechanisms governing transcriptional regulation in humans
    Macintyre, Geoffrey John ( 2011)
    Understanding the molecular mechanisms governing transcriptional regulation is crucial in complex disease etiology. The prohibitive cost of experimentally mapping the complete set of genomic features governing gene regulation in humans means that computational techniques are required to define regions for further experimental exploration. This thesis has therefore developed a suite of computational techniques that assist in understanding aberrant modes of transcriptional regulation in human diseases. Historically, financial and technical limitations have restricted most research on gene regulation to a fraction of the genome made up of regions around transcription start sites (TSSs). This is largely motivated by the observation that most regulatory elements are contained in this fraction in the model organisms frequently used to study transcriptional regulation (such as yeast). In humans however, there are examples of transcription factors (TFs) bound over one megabase from the genes they regulate. Although such cases are considered rare, the increased complexity of gene regulation in humans over model organisms, essentially requires that TFs must assemble and regulate genes beyond gene promoter regions. As such, we have sought to assess the compromise made by limiting search to promoter regions in humans, and characterise the distance around TSSs that would warrant adequate encapsulation of the regulatory elements controlling any given gene. Our results show that in humans, searching for regulatory elements in promoter regions characterises at best 40-60% of the events controlling a gene. Instead, we estimate that search 100kb up and downstream of a TSS is necessary to characterise the majority of regulatory elements controlling any given gene. Using this distance estimate, we assessed a number of disease associated SNPs with no known function, for their potential to exert regulatory effects. We revealed two strong candidates that impact the regulation of disease associated genes: a SNP affecting binding of the AhR:Arnt complex located 77kb downstream of the MTHFR gene in schizophrenia, and a SNP creating a REST binding site 25kb upstream of MSMB in prostate cancer. Another important aspect in understanding mechanisms of transcriptional regulation in human disease is to determine the pathways or biological processes which are likely to be perturbed in the disease. Commonly, this is done through manipulation of data resulting from gene expression profiling techniques. Gene expression profiling provides insight into the functions of genes at a molecular level, and clustering of these profiles can facilitate the identification of the underlying driving biological program causing genes’ co-expression. Standard clustering methods, grouping genes based on similar expression values, can fail to capture weak expression correlations potentially causing genes in the same biological process to be grouped separately. We therefore developed a novel clustering algorithm, which incorporates functional gene information from the Gene Ontology into the clustering process, that results in more biologically meaningful clusters. We validated this method using two multi-cancer microarray datasets, showing that our method provides clusters which are more easily interpreted biologically. In addition, we show the potential of such methods for the exploration of cancer etiology. As well as identifying groups of genes which are co-regulated and important for the progression of a disease, we have also developed methods to assist in identification of the regulatory elements controlling these genes. The first such method, relies on the output of a genome-wide supervised regulatory element predictor, to identify the TFs and sequence features important for successful prediction of these regulatory elements. The supervised predictor uses genome-wide ChIP-SEQ data to train a classifier based on k-mer features (4 bases long) from a single chromosome to predict ChIP-SEQ binding sites genome-wide. The output of the classifier is a set of weighted k-mer features selected via recursive feature elimination. Our algorithm takes these weighted features as input and combines them based on statistical over-representation in the positive training set to build PWMs representing the sequence features important for prediction of ChIP-SEQ binding sites. In a study of four TFs (STAT1, MYC, JUN and GABP), we show that the most important sequence feature for prediction of TF binding sites are CpG dinucleotides. We suggest this is likely due to the increased likelihood that CpGs change DNA methylation status, an appealing function for TFBSs. In addition, we show that many of the remaining features represent the canonical binding site for the TF, and PWMs for known partner TFs. These results highlight that binding of TFs in humans is dependent not only on the recognition of the canonical binding site of the TF, but its surrounding sequence features such as partner TF binding sites. Following on from the idea of importance of partner TF binding, and using our previous estimate of a 100kb search space around TF target gene TSSs, we developed a novel cis-regulatory module prediction algorithm. The goal of the algorithm is to provide the set of TFs, and their binding locations, that control the regulation of set of putatively co-regulated genes. Specifically, the algorithm takes as input, a set of co-regulated genes and a PWM for a single TF expected to control the regulation of those genes. Using this PWM to seed the cis-regulatory module (CRM) search, the algorithm can accurately predict CRM composition and location when partner TFs are known, or unknown (partner PWMs are predicted de novo). Comparison against existing CRM predictors shows that our algorithm compares favorably when using empirical mapping of two transcription factors, STAT1 and C-JUN, involved in the response of HelaS3 and K562 cells to IFNgamma treatment as a gold-standard. We demonstrate that a `simulated knockout' of C-JUN using the predicted CRMs recapitulates target gene expression in an in vivo mouse C-JUN knockout model. In addition, predicted CRMs were used to determine the effects of partner transcription factors on estrogen receptor mediated tamoxifen response. We show that tamoxifen preferentially targets estrogen receptors bound downstream of their target gene and that partner transcription factors binding with estrogen receptor provides protection against the effects of tamoxifen. Furthermore, we apply our CRM predictor and `simulated knockout' analysis to clusters of co-regulated genes identified using our gene ontology based clustering algorithm. The resulting gene regulatory networks not only recapitulate known biology in metastasis and metabolism in cancer, but provide novel insight into how particular TFs control this behaviour. Finally, we present computational approaches for identifying disease associated SNPs and CNVs that are likely to disrupt the binding of a transcription factor which is critical in the progression of the disease. This is important as determining the functional impact of non-coding disease-associated single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) is challenging. Many of these SNPs are likely to be regulatory SNPs (rSNPs): variations which affect the ability of a transcription factor (TF) to bind to DNA. However, experimental procedures for identifying rSNPs are expensive and labour intensive. Therefore, in silico methods are required for rSNP prediction. By scoring two alleles with a TF position weight matrix (PWM), it can be determined which SNPs are likely rSNPs. However, predictions in this manner are noisy and no method exists that determines the statistical significance of a nucleotide variation on a PWM score. We have designed an algorithm for in silico rSNP detection called is-rSNP. We employ novel convolution methods to determine the complete distributions of PWM scores and ratios between allele scores, facilitating assignment of statistical significance to rSNP effects. We have tested our method on 41 experimentally verified rSNPs, correctly predicting the disrupted TF in 28 cases. We also analysed 146 disease-associated SNPs with no known functional impact in an attempt to identify candidate rSNPs. Of the 11 significantly predicted disrupted TFs, 9 had previous evidence of being associated with the disease in the literature. These results demonstrate that is-rSNP is suitable for high-throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS. As well as SNPs impacting on regulatory elements, copy-number variations CNVs can also have an effect. By integrating CNV data and expression profiling data across cancer patient cohorts, and TSS-TFBS binding pairs from ChIP-SEQ and expression profiling data, we are able to identify regions where a CNV is present in the TFBS but not in the target gene. Using these regions, we employ trend testing to find individuals with a loss, normal or gain in copy number that corresponds to reduced, normal or increased expression of the target gene, respectively. Of the candidate CNVs identified, we show that the CNVs affecting two target genes, GEMIN4 and DDX9 are directly linked to overall survival in serous cystadenocarcinoma. In summary, this thesis presents a range of computational methods and analyses which improve our understanding of transcriptional regulation and how this affects progression of various human diseases.