Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 8 of 8
  • Item
    Thumbnail Image
    Investigating the evolution of structural variation in cancer
    Cmero, Marek ( 2017)
    Cancers arise from single progenitor cells that acquire mutations, eventually dividing into mixed populations with distinct genotypes. These populations can be estimated by identifying common mutational profiles, using computational techniques applied to sequencing data from tumour tissue samples. Existing methods have largely focused on single nucleotide variants (SNVs), despite growing evidence of the importance of structural variation (SV) as drivers in certain subtypes of cancer. While some approaches use copy-number aberrant SVs, no method has incorporated balanced rearrangements. To address this, I developed a Bayesian inference approach for estimating SV cancer cell fraction called SVclone. I validated SVclone using in silico mixtures of real samples in known proportions and found that clonal deconvolution using SV breakpoints can yield comparable results to SNV-based clustering. I then applied the method to 2,778 whole-genomes across 39 distinct tumour types, uncovering a subclonal copy-number neutral rearrangement phenotype with decreased overall survival. This clinically relevant finding could not have been found using existing methods. To further expand the methodology, and demonstrate its application to low data quality contexts, I developed a novel statistical approach to test for clonal differences in high-variance, formalin-fixed, paraffin-embedded (FFPE) samples. Together with variant curation strategies to minimise FFPE artefact, I applied the approach to longitudinal samples from a cohort of neo-adjuvant treated prostate cancer patients to investigate whether clonal differences can be inferred in highly noisy data. This thesis demonstrates that characterising the evolution of structural variation, particularly balanced rearrangements, results in clinically relevant insights. Identifying the patterns and dynamics of structural variation in the context of tumour evolution will ultimately help improve understanding of common pathways of tumour progression. Through this knowledge, cancers driven by SVs will have clearer prognoses and clinical treatment decisions will ultimately be improved, leading to better patient outcomes.
  • Item
    Thumbnail Image
    Computational substructure querying and topology prediction of the beta-sheet
    Ho, Hui Kian ( 2014)
    Studying the three-dimensional structure of proteins is essential to understanding their function, and ultimately, their dysfunction that causes disease. The limitations of experimental protein structure determination presents a need for computational approaches to protein structure prediction and analysis. The beta-sheet is a commonly occurring protein substructure important to many biological processes and are often implicated in neurological disorders. Targeted experimental studies of beta-sheets are especially difficult due to their general insolubility in isolation. This thesis presents a series of contributions to the computational analysis and prediction of beta-sheet structure, which are useful for knowledge discovery and for directing more detailed experimental work. Approaches for predicting the simplest type of beta-sheet, the beta-hairpin, are first described. Improvements over existing methods are obtained by using the most important beta-hairpin features identified through systematic feature selection. An examination of the most important features provides a physiochemical basis of their usefulness in beta-hairpin prediction. New methods for the more general problem of beta-sheet topology prediction are described. Unlike recent methods, ours are independent of multiple sequence alignment (MSAs) and therefore do not rely on the coverage of reference sequence databases or sequence homology. Our evaluations showed that our methods do not exhibit the same reductions in performance as a state-of-the-art method for sequences with low quality MSAs. A new method for the indexing and querying of beta-sheet substructures, called BetaSearch, is described. BetaSearch exploits the inherent planar constraints of beta-sheet structure to achieve significant speedups over existing graph indexing and conventional 3D structure search methods. Case studies are presented that demonstrate the potential of this method for the discovery of biologically interesting beta-sheet substructures. Finally, a purpose-built open source toolkit for generating 2D protein maps is described, which is useful for the coarse-grained analysis and visualisation of 3D protein structures. It can also be used in existing knowledge discovery pipelines for automated structural analysis and prediction tasks, as a standalone application, or imported into existing experimental applications.
  • Item
    Thumbnail Image
    Effective integration of diverse biological datasets for better understanding of cancer
    Gaire, Raj Kumar ( 2012)
    Cancer is a disease of malfunctioning cells. Nowadays, experiments in cancer research have been producing a large number of datasets that contain measurements of various aspects of cancer. Similarly, datasets in cellular biology are becoming better organised and increasingly available. An effective integration of these datasets to understand the mechanisms of cancers is a challenging task. In this research, we develop novel integration methods and apply them to some diverse datasets of cancer. Our analysis finds that subtypes of cancers share common features that may be useful to direct cancer biologists to find better cure of cancers. As our first contribution, we developed MIRAGAA, a statistical approach to assess the coordinated changes of genome copy numbers and microRNA (miRNA) expression. Genetic diseases like cancer evolve through microevolution where random lesions that provide the biggest advantage to the diseases can stand out in their frequent occurrence in multiple samples. At the same time, a gene function can be changed by aberration of the corresponding gene or modification of expression levels of microRNA which attenuates the gene. In a large number of disease samples, these two mechanisms might be distributed in a coordinated and almost mutually exclusive manner. Understanding this coordination may assist in identifying changes which significantly produce the same functional impact on cancer phenotype, and further identify genes that are universally required for cancer. MIRAGAA has been evaluated on the cancer genome atlas (TCGA) Glioblastoma Multiforme datasets. In these datasets, a number of genome regions coordinating with different miRNAs are identified. Although well known for their biological significance, these genes and miRNAs would be left undetected for being not significant enough if the two datasets were analysed individually. Genes can show significant changes in their expression levels when genetically diseased cells are compared with non-diseased cells. Biological networks are often used to analyse the genetic expression profiles to identify active subnetworks (ASNs) in the diseases. Existing methodologies for discovering ASNs mostly use node centric approaches and undirected PPI networks. This can limit their ability to find the most meaningful ASNs. As our second contribution, we developed Bionet which aims to identify better ASNs by using (i) integrated regulatory networks, (ii) directions of regulations of genes, and (iii) combined node and edge scores. We simplify and extend previous methodologies to incorporate edge evaluations and lessen their sensitivity to significance thresholds. We formulate our objective functions using mixed integer linear programming (MIP) and show that optimal solutions may be obtained. As our third contribution, we integrated and analysed the disease datasets of glioma, glioblastoma and breast cancer with pathways and biological networks. Our analysis of two independent breast cancer datasets finds that the basal subtype of this cancer contains positive feedback loops across 7 genes, AR, ESR1, MYC, E2F2, PGR, BCL2 and CCND1 that could potentially explain the aggressive nature of this cancer subtype. A comparison of the basal subtype of breast cancer and the mesenchymal subtype of glioblastoma ASNs shows that an ASN in the vicinity of IL6 is conserved across the two subtypes. CD44 is found to be the most outcome predictor gene in both glioblastoma and breast cancer and may be used as biomarker. Our analysis suggests that cancer subtypes from two different cancers can show molecular similarities that are identifiable by using integrated biological networks.
  • Item
    Thumbnail Image
    Scalable approaches for analysing high density single nucleotide polymorphism array data
    Wong, Gerard Kum Peng ( 2012)
    Prior to making inferences from the raw data produced by these microarrays, several challenges need to be addressed. First, it is important to limit the impact of noise on microarray measurements while maintaining data integrity. An unexplored aspect of noise is the extent of probeset sequence identity in SNP microarrays. Second, microarray-based datasets often have at least two orders of magnitude more probesets than the number of samples they describe. This poses a challenge for traditional statistical tests when used in this context. Third, the number of features in each dataset is large even when sample sizes are small, thus computationally efficient approaches are required to analyse these datasets. Finally, with improving resolution of SNP arrays, there is a need to exploit this improvement in resolution to identify finer-scaled mutations in the human genome. Most existing approaches deal with these challenges at an individual sample level and do not look for consensus change across the population to identify sites of DNA mutation. Other approaches artificially smooth or segment the data to obtain uniform segments of copy number change, and lose possible fine-scaled copy number changes in the process. Others are based on computationally expensive approaches that do not scale well to array resolution and sample size. Our first contribution is a comprehensive survey of the sequence identity of all probesets for all variants of the Affymetrix Genome-Wide Human SNP array. This survey assesses the target uniqueness of every probeset and provides a basis for the development of a set of gold standard labels of copy number change between genders. The derived sets of gold standard labels are a benchmark for assessing the performance of various algorithms in detecting recurrent copy number change. This benchmark is utilised in the evaluation of our second and third contribution. Our second contribution is a statistical approach called Detecting Recurrent Copy Number Changes Using Rank Order Statistics (DRECS), which is designed to identify regions of consensus copy number change across multiple samples in SNP array datasets. Through the use of rank-based statistics, DRECS efficiently draws on the statistical power of using multiple samples to identify fine-scaled copy number changes down to the width of a single probe in a computationally efficient way. Our third contribution is called Sum of Ranks Exact Test (SoRET), a non-parametric extension of DRECS. SoRET addresses SNP datasets with small sample sizes and makes no assumptions about the distribution from which the data was sampled. Its performance in terms of Type I and Type II errors is superior to competitive parametric and non-parametric statistical tests for small sample sizes. Our fourth contribution is a feature set reduction approach called FSR. FSR enables existing filter-based feature selection approaches to handle large dimensional microarray-type datasets by pruning irrelevant and redundant features. A novel scoring measure is developed to assess the strength of each feature in terms of sample class discrimination. FSR uses measures of entropy to efficiently gauge the contribution of higher order feature patterns to avoid combinatorial explosions in assessing the utility of features. In our tested datasets, classifiers trained on features selected from FSR-reduced feature sets have shown notably better predictive accuracy than classifiers trained on features selected from complete feature sets.
  • Item
    Thumbnail Image
    Compression of large DNA databases
    Kuruppu, Shanika Sewwandini ( 2012)
    The thesis explores algorithms to efficiently store and access repetitive DNA sequence collections produced by large-scale genome sequencing projects. First, existing general-purpose and DNA compression algorithms are evaluated for their suitability for compressing large collections of DNA sequences. Then two novel algorithms for compressing large collections of DNA sequences are introduced. The first algorithm is COMRAD, which is a disk-based dictionary compression algorithm that iteratively detects repetitions that occur across multiple sequences, and substitutes them with non-terminals. The method showed that repeats can be feasibly detected across multiple sequences for relatively large collections, while preserving sequence boundaries. The second algorithm is RLZ, which compresses highly similar sequence collections using a simple LZ77 parsing of each sequence with respect to a sequence chosen to be the reference. RLZ was also extended to conduct non-greedy LZ77 parsing, and with the combination of a few other optimisations, the algorithm indirectly detects and encodes approximate matches. RLZ is memory efficient, and is one of the fastest DNA sequence compression algorithms existing to date, both in terms of the compression and decompression speed. Both COMRAD and RLZ can compress sequence collections of fully assembled chromosomes and genomes, as well as sets of contigs and reads. Both algorithms also support individual sequence extraction and random access queries on compressed sets. RLZ was also extended to a full self-index by enabling substring searching on compressed sets with some limitations on detecting substrings. Since the effectiveness of RLZ compression depends on the reference sequence chosen for a collection, techniques for constructing reference sequences that better represent the collection were also explored. The results showed that using a reference sequence constructed with repeats detected by dictionary compressors leads to significant improvements in the compressed sizes produced by RLZ.
  • Item
    Thumbnail Image
    Computational methods for understanding the molecular mechanisms governing transcriptional regulation in humans
    Macintyre, Geoffrey John ( 2011)
    Understanding the molecular mechanisms governing transcriptional regulation is crucial in complex disease etiology. The prohibitive cost of experimentally mapping the complete set of genomic features governing gene regulation in humans means that computational techniques are required to define regions for further experimental exploration. This thesis has therefore developed a suite of computational techniques that assist in understanding aberrant modes of transcriptional regulation in human diseases. Historically, financial and technical limitations have restricted most research on gene regulation to a fraction of the genome made up of regions around transcription start sites (TSSs). This is largely motivated by the observation that most regulatory elements are contained in this fraction in the model organisms frequently used to study transcriptional regulation (such as yeast). In humans however, there are examples of transcription factors (TFs) bound over one megabase from the genes they regulate. Although such cases are considered rare, the increased complexity of gene regulation in humans over model organisms, essentially requires that TFs must assemble and regulate genes beyond gene promoter regions. As such, we have sought to assess the compromise made by limiting search to promoter regions in humans, and characterise the distance around TSSs that would warrant adequate encapsulation of the regulatory elements controlling any given gene. Our results show that in humans, searching for regulatory elements in promoter regions characterises at best 40-60% of the events controlling a gene. Instead, we estimate that search 100kb up and downstream of a TSS is necessary to characterise the majority of regulatory elements controlling any given gene. Using this distance estimate, we assessed a number of disease associated SNPs with no known function, for their potential to exert regulatory effects. We revealed two strong candidates that impact the regulation of disease associated genes: a SNP affecting binding of the AhR:Arnt complex located 77kb downstream of the MTHFR gene in schizophrenia, and a SNP creating a REST binding site 25kb upstream of MSMB in prostate cancer. Another important aspect in understanding mechanisms of transcriptional regulation in human disease is to determine the pathways or biological processes which are likely to be perturbed in the disease. Commonly, this is done through manipulation of data resulting from gene expression profiling techniques. Gene expression profiling provides insight into the functions of genes at a molecular level, and clustering of these profiles can facilitate the identification of the underlying driving biological program causing genes’ co-expression. Standard clustering methods, grouping genes based on similar expression values, can fail to capture weak expression correlations potentially causing genes in the same biological process to be grouped separately. We therefore developed a novel clustering algorithm, which incorporates functional gene information from the Gene Ontology into the clustering process, that results in more biologically meaningful clusters. We validated this method using two multi-cancer microarray datasets, showing that our method provides clusters which are more easily interpreted biologically. In addition, we show the potential of such methods for the exploration of cancer etiology. As well as identifying groups of genes which are co-regulated and important for the progression of a disease, we have also developed methods to assist in identification of the regulatory elements controlling these genes. The first such method, relies on the output of a genome-wide supervised regulatory element predictor, to identify the TFs and sequence features important for successful prediction of these regulatory elements. The supervised predictor uses genome-wide ChIP-SEQ data to train a classifier based on k-mer features (4 bases long) from a single chromosome to predict ChIP-SEQ binding sites genome-wide. The output of the classifier is a set of weighted k-mer features selected via recursive feature elimination. Our algorithm takes these weighted features as input and combines them based on statistical over-representation in the positive training set to build PWMs representing the sequence features important for prediction of ChIP-SEQ binding sites. In a study of four TFs (STAT1, MYC, JUN and GABP), we show that the most important sequence feature for prediction of TF binding sites are CpG dinucleotides. We suggest this is likely due to the increased likelihood that CpGs change DNA methylation status, an appealing function for TFBSs. In addition, we show that many of the remaining features represent the canonical binding site for the TF, and PWMs for known partner TFs. These results highlight that binding of TFs in humans is dependent not only on the recognition of the canonical binding site of the TF, but its surrounding sequence features such as partner TF binding sites. Following on from the idea of importance of partner TF binding, and using our previous estimate of a 100kb search space around TF target gene TSSs, we developed a novel cis-regulatory module prediction algorithm. The goal of the algorithm is to provide the set of TFs, and their binding locations, that control the regulation of set of putatively co-regulated genes. Specifically, the algorithm takes as input, a set of co-regulated genes and a PWM for a single TF expected to control the regulation of those genes. Using this PWM to seed the cis-regulatory module (CRM) search, the algorithm can accurately predict CRM composition and location when partner TFs are known, or unknown (partner PWMs are predicted de novo). Comparison against existing CRM predictors shows that our algorithm compares favorably when using empirical mapping of two transcription factors, STAT1 and C-JUN, involved in the response of HelaS3 and K562 cells to IFNgamma treatment as a gold-standard. We demonstrate that a `simulated knockout' of C-JUN using the predicted CRMs recapitulates target gene expression in an in vivo mouse C-JUN knockout model. In addition, predicted CRMs were used to determine the effects of partner transcription factors on estrogen receptor mediated tamoxifen response. We show that tamoxifen preferentially targets estrogen receptors bound downstream of their target gene and that partner transcription factors binding with estrogen receptor provides protection against the effects of tamoxifen. Furthermore, we apply our CRM predictor and `simulated knockout' analysis to clusters of co-regulated genes identified using our gene ontology based clustering algorithm. The resulting gene regulatory networks not only recapitulate known biology in metastasis and metabolism in cancer, but provide novel insight into how particular TFs control this behaviour. Finally, we present computational approaches for identifying disease associated SNPs and CNVs that are likely to disrupt the binding of a transcription factor which is critical in the progression of the disease. This is important as determining the functional impact of non-coding disease-associated single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) is challenging. Many of these SNPs are likely to be regulatory SNPs (rSNPs): variations which affect the ability of a transcription factor (TF) to bind to DNA. However, experimental procedures for identifying rSNPs are expensive and labour intensive. Therefore, in silico methods are required for rSNP prediction. By scoring two alleles with a TF position weight matrix (PWM), it can be determined which SNPs are likely rSNPs. However, predictions in this manner are noisy and no method exists that determines the statistical significance of a nucleotide variation on a PWM score. We have designed an algorithm for in silico rSNP detection called is-rSNP. We employ novel convolution methods to determine the complete distributions of PWM scores and ratios between allele scores, facilitating assignment of statistical significance to rSNP effects. We have tested our method on 41 experimentally verified rSNPs, correctly predicting the disrupted TF in 28 cases. We also analysed 146 disease-associated SNPs with no known functional impact in an attempt to identify candidate rSNPs. Of the 11 significantly predicted disrupted TFs, 9 had previous evidence of being associated with the disease in the literature. These results demonstrate that is-rSNP is suitable for high-throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS. As well as SNPs impacting on regulatory elements, copy-number variations CNVs can also have an effect. By integrating CNV data and expression profiling data across cancer patient cohorts, and TSS-TFBS binding pairs from ChIP-SEQ and expression profiling data, we are able to identify regions where a CNV is present in the TFBS but not in the target gene. Using these regions, we employ trend testing to find individuals with a loss, normal or gain in copy number that corresponds to reduced, normal or increased expression of the target gene, respectively. Of the candidate CNVs identified, we show that the CNVs affecting two target genes, GEMIN4 and DDX9 are directly linked to overall survival in serous cystadenocarcinoma. In summary, this thesis presents a range of computational methods and analyses which improve our understanding of transcriptional regulation and how this affects progression of various human diseases.
  • Item
    Thumbnail Image
    Detecting gene-cancer associations by analysing gene-expression microarrays
    Shi, Fan ( 2011)
    Modern bioinformatics studies have shown that many physiological and behavioral characteristics in organisms are influenced by the information encoded in genes. Cancer genomics is a subject that specifically studies the genetic mechanisms of the formation and progression of cancer in order to improve the diagnosis and prognosis of cancer. An important task in cancer genomics is to detect gene-cancer associations, which is the biological focus of this thesis. DNA microarrays are a high throughput bioinformatics technique to assay the expression levels for thousands of genes simultaneously. By comparing the gene expression levels between different cancer types or subtypes, we may discover potential gene-cancer associations. However, the large scale of gene expression microarrays requires effective and efficient computational approaches for their analysis. Thus, from a computational perspective, we focus on developing computational approaches to detect gene-cancer associations based on gene expression microarrays in this thesis. We summarise three key problems and the corresponding traditional methods in the analysis of gene expression microarrays. First, statistical tests can be used to identify differentially expressed genes, which show significantly different expression patterns between different cancer types or subtypes. Second, unsupervised clustering methods can be used to identify co-expressed genes, which are groups of differentially expressed genes that show similar expression levels in the same cancer types. Third, classification methods can be used to predict the types or subtypes of cancer based on differentially expressed genes. However, several challenges, such as the high dimensionality in microarrays and the small sample sizes in cancer studies, may lead to low accuracy and efficiency problems for analysis based on these traditional methods. In this thesis, we have proposed three computational approaches to address the above problems. First, we have developed a meta-analysis method, called Incomplete Gene Meta-analysis (IGM), to identify differentially expressed genes by integrating multiple studies. The IGM method is able to integrate datasets from different microarray platforms and incorporate the genes that are not measured in all datasets, which we refer to as incomplete genes. In our evaluation, we verify the importance of including incomplete genes, and the experimental results demonstrate that IGM identifies more significant genes by imputing the statistical significance of incomplete genes than traditional methods. Second, we have proposed an unsupervised Bi-ordering Analysis (BOA) method to detect local patterns, where a subset of genes are co-expressed under a subset of samples, called biclusters, in microarrays. The BOA method uses an iterative process to identify consistently over or under-expressed gene groups in specific samples. This approach addresses several challenges for detecting biclusters, including the identification of biologically meaningful patterns, the efficiency of biclustering algorithms and the stability of biclusters. Our statistical assessments demonstrate both the statistical and biological significance of the biclusters produced by our method. Third, we have proposed a method for making multiple predictions with an associated confidence level to classify the cancers of unknown primary origin (CUP). This classification method is able to identify a set of the most likely cancer types and assign a confidence level to the predictions for CUP samples. Our method for making multiple predictions takes into account the biological similarity in different cancer types at a gene expression level, and is thus more suitable for classifying multi-class cancer samples than making a single class prediction. Our evaluation verifies the importance of making multiple predictions and validates our method.
  • Item
    Thumbnail Image
    Algorithms for the study of RNA and protein structure
    Stivala, Alexander David ( 2010)
    The growth in the number of known sequences and structures of RNA and protein molecules has led to the need to solve many computationally demanding problems in the analysis of RNA and protein structure. This thesis describes algorithms for structural comparison of RNA and protein molecules. In the case of proteins, it also describes a technique for automatically generating two-dimensional diagrammatic representations for visual comparison. A general technique for parallelizing dynamic programs in a straightforward way, by means of a shared lock-free hash table implementation and randomization of subproblem ordering is described. This generic approach is applied to several well-known dynamic programs, as well as a dynamic program for structural alignment of RNA molecules by aligning their base pairing probability matrices. Two algorithms for protein structure and substructure searching are described. These algorithms are also capable of finding non-sequential matches, that is, matches between structures where the sequential order of secondary structure elements is not preserved. The first algorithm is based on the relaxation of an earlier quadratic integer problem (QIP) formulation to a quadratic program (QP). The second algorithm uses the same formulation but approximates it using simulated annealing. It is shown that this results in significant increases in speed. This algorithm is also capable of greater accuracy when assessed as a fold recognition method. A parallel implementation of this algorithm on modern graphics processing unit (GPU) hardware is also described. This parallel implementation results in a further significant speedup, and, to the best of our knowledge, is the first use of a GPU for the protein structural search problem. Finally, a system to automatically generate two-dimensional representations of protein structure is described. Such diagrams are particularly useful in analysing complex protein folds. A method for using these diagrams as an interface to the protein substructure search methods is also described.