Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 8 of 8
  • Item
    Thumbnail Image
    Network Architecture for Prediction of Emergence in Complex Biological Systems
    Ghosh Roy, Gourab ( 2022)
    Emergence of properties at the system level, where these properties are not observed at the individual entity level, is an important feature of complex systems. Biological system emergent properties have critical roles in the functioning of organisms and the disruptions to normal functioning, and are relevant to the treatment of diseases like cancer. Complex biological systems can be modeled by abstractions in the form of molecular networks like gene regulatory networks (GRNs) and signaling networks with nodes representing molecules like genes and edges representing molecular interactions. The thesis aims at exploring the use of the architecture of these networks to predict emergence of system properties. First, to better infer the network architecture with aspects that can be useful in predicting emergence, we propose a novel algorithm Polynomial Lasso Bagging or PoLoBag for signed GRN inference from gene expression data. The GRN edge signs represent the nature of the regulatory relationships, activating or inhibitory. Our algorithm gives more accurate signed inference compared to state-of-the-art algorithms, and overcomes their weaknesses by also inferring edge directions and cycles. We also show how combining signed GRN architecture with dynamical information in our proposed dynamical K-core method predicts emergent states of the gene regulatory system effectively. Second, we investigate the existence of the bow-tie architectural organization in the GRNs of species of widely varying complexity. Prior work has shown the existence of this bow-tie feature in the GRNs of only some eukaryotes. Our investigation covers GRNs of prokaryotes to unicellular and multicellular eukaryotes. We find that the observed bow-tie architecture is a characteristic feature of GRNs. Based on differences that we observe in the bow-tie architectures across species, we predict a trend in the emergence of the dynamical gene regulatory system property of controllability with varying species complexity. Third, from input genotype data we predict an emergent phenotype at the organism level – the cancer-specific survival risk. We propose a novel Mutated Pathway Visible Neural Network or MPVNN, designed using prior knowledge of signaling network architecture and additional mutation data-based edge randomization. This randomization models how known signaling network architecture changes for a particular cancer type, which is not modeled by state-of-the-art visible neural networks. We suggest that MPVNN performs cancer-specific risk prediction better than other similar sized NN and non-NN survival analysis methods, while also providing reliable interpretations of the predictions. These three research contributions taken together make significant advances towards our goal of using molecular network architecture for better prediction of emergence, which can inform treatment decisions and lead to novel therapeutic approaches and is of value to computational biologists and clinicians.
  • Item
    Thumbnail Image
    Understanding role of provenance in bioinformatics workflows and enabling interoperable computational analysis sharing
    Khan, Farah Zaib ( 2018)
    The automation of computational analyses in data-intensive domains such as genomics through scientific workflows is a widely adopted practice in many fields of research nowadays. Computationally driven data-intensive experiments using workflows enable Automation, Scaling, Adaption and Provenance support (ASAP). Provenance data collection is an essential factor for any computational workflow-centric research to achieve reproducibility, transparency and support trust in the published results. At present capture of provenance information across the plethora of workflow management systems and custom software platforms in the bioinformatics domain is not well supported and as such, there exist numerous challenges associated with the effective sharing, publication, understandability, reproducibility and repeatability of scientific workflows. This thesis focuses on providing a unified, interoperable and systematised view of provenance with specific focus on workflow environments in the bioinformatics domain. We identify and overcome the current disconnect between various workflows systems and their existing provenance representations. Through empirical analysis of complex genomic data analysis workflows using three exemplar workflow systems, we identify implicit assumptions that arise. These assumptions produce an incomplete view of provenance resulting in insufficient details that impact on workflow enactment requirements and ultimately on the reproducibility of the given analysis. We propose a set of recommendations to mitigate against such assumptions and enable workflow systems to document and capture complete provenance information that can subsequently be used for re-enacting workflows in other contexts and potentially using other workflow platforms. Based on this empirical case study and pragmatic analysis of related literature, we define a hierarchical provenance framework offering `Levels of Provenance and Resource Sharing''. Each level of this framework addresses specific provenance recommendations and supports the capture of rich provenance information, with the topmost layer enabling the sharing of comprehensive and executable workflows utilising retrospective provenance. To realise this framework, we leverage community-driven, domain-neutral, platform-independent and open-source standards to implement ``CWLProv'' - a format for the methodical representation of provenance supporting workflow enactment aggregating resources specific to the given enactment and associated workflow configuration settings. We realise CWLProv through the Common Workflow Language (CWL) for workflow definition and utilise Research Objects (ROs) for resource aggregation and PROV-Data Model (PROV-DM) to support the capture of retrospective provenance information as required for subsequent workflow enactments. To demonstrate the applicability of CWLProv, we extend an existing workflow executor (cwltool) to provide a reference implementation that generates metadata and provenance-rich interoperable workflow-centric ROs. This approach aggregates and preserves data and methods needed to support the coherent sharing of computational analyses and experiments. Evaluation of CWLProv using real-life bioinformatics pipelines is demonstrated to highlight the utility of the approach demonstrating the interoperability of workflow analyses and the benefits to research reproducibility more generally.
  • Item
    Thumbnail Image
    Investigating the evolution of structural variation in cancer
    Cmero, Marek ( 2017)
    Cancers arise from single progenitor cells that acquire mutations, eventually dividing into mixed populations with distinct genotypes. These populations can be estimated by identifying common mutational profiles, using computational techniques applied to sequencing data from tumour tissue samples. Existing methods have largely focused on single nucleotide variants (SNVs), despite growing evidence of the importance of structural variation (SV) as drivers in certain subtypes of cancer. While some approaches use copy-number aberrant SVs, no method has incorporated balanced rearrangements. To address this, I developed a Bayesian inference approach for estimating SV cancer cell fraction called SVclone. I validated SVclone using in silico mixtures of real samples in known proportions and found that clonal deconvolution using SV breakpoints can yield comparable results to SNV-based clustering. I then applied the method to 2,778 whole-genomes across 39 distinct tumour types, uncovering a subclonal copy-number neutral rearrangement phenotype with decreased overall survival. This clinically relevant finding could not have been found using existing methods. To further expand the methodology, and demonstrate its application to low data quality contexts, I developed a novel statistical approach to test for clonal differences in high-variance, formalin-fixed, paraffin-embedded (FFPE) samples. Together with variant curation strategies to minimise FFPE artefact, I applied the approach to longitudinal samples from a cohort of neo-adjuvant treated prostate cancer patients to investigate whether clonal differences can be inferred in highly noisy data. This thesis demonstrates that characterising the evolution of structural variation, particularly balanced rearrangements, results in clinically relevant insights. Identifying the patterns and dynamics of structural variation in the context of tumour evolution will ultimately help improve understanding of common pathways of tumour progression. Through this knowledge, cancers driven by SVs will have clearer prognoses and clinical treatment decisions will ultimately be improved, leading to better patient outcomes.
  • Item
    Thumbnail Image
    Computational substructure querying and topology prediction of the beta-sheet
    Ho, Hui Kian ( 2014)
    Studying the three-dimensional structure of proteins is essential to understanding their function, and ultimately, their dysfunction that causes disease. The limitations of experimental protein structure determination presents a need for computational approaches to protein structure prediction and analysis. The beta-sheet is a commonly occurring protein substructure important to many biological processes and are often implicated in neurological disorders. Targeted experimental studies of beta-sheets are especially difficult due to their general insolubility in isolation. This thesis presents a series of contributions to the computational analysis and prediction of beta-sheet structure, which are useful for knowledge discovery and for directing more detailed experimental work. Approaches for predicting the simplest type of beta-sheet, the beta-hairpin, are first described. Improvements over existing methods are obtained by using the most important beta-hairpin features identified through systematic feature selection. An examination of the most important features provides a physiochemical basis of their usefulness in beta-hairpin prediction. New methods for the more general problem of beta-sheet topology prediction are described. Unlike recent methods, ours are independent of multiple sequence alignment (MSAs) and therefore do not rely on the coverage of reference sequence databases or sequence homology. Our evaluations showed that our methods do not exhibit the same reductions in performance as a state-of-the-art method for sequences with low quality MSAs. A new method for the indexing and querying of beta-sheet substructures, called BetaSearch, is described. BetaSearch exploits the inherent planar constraints of beta-sheet structure to achieve significant speedups over existing graph indexing and conventional 3D structure search methods. Case studies are presented that demonstrate the potential of this method for the discovery of biologically interesting beta-sheet substructures. Finally, a purpose-built open source toolkit for generating 2D protein maps is described, which is useful for the coarse-grained analysis and visualisation of 3D protein structures. It can also be used in existing knowledge discovery pipelines for automated structural analysis and prediction tasks, as a standalone application, or imported into existing experimental applications.
  • Item
    Thumbnail Image
    Rapid de novo methods for genome analysis
    HALL, ROSS STEPHEN ( 2013)
    Next generation sequencing methodologies have resulted in an exponential increase in the amount of genomic sequence data available to researchers. Valuable tools in the initial analysis of such data for novel features are de novo techniques - methods which employ a minimum of comparative sequence information from known genomes. In this thesis I describe two heuristic algorithms for the rapid de novo analysis of genomic sequence data. The first algorithm employs a multiple Fast Fourier Transform, mapped to two dimensional spaces. The resulting bitmap clearly illustrates periodic features of a genome including coding density. The compact representation allows mega base scales of genomic data to be rendered in a single bitmap. The second algorithm RTASSS, (RNA Template Assisted Secondary Structure Search) predicts potential members of RNA gene families that are related by similar secondary structure, but not necessarily conserved sequence. RTASSS has the ability to find candidate structures similar to a given template structure without the use of sequence homology. Both algorithms have a linear complexity.
  • Item
    Thumbnail Image
    Effective integration of diverse biological datasets for better understanding of cancer
    Gaire, Raj Kumar ( 2012)
    Cancer is a disease of malfunctioning cells. Nowadays, experiments in cancer research have been producing a large number of datasets that contain measurements of various aspects of cancer. Similarly, datasets in cellular biology are becoming better organised and increasingly available. An effective integration of these datasets to understand the mechanisms of cancers is a challenging task. In this research, we develop novel integration methods and apply them to some diverse datasets of cancer. Our analysis finds that subtypes of cancers share common features that may be useful to direct cancer biologists to find better cure of cancers. As our first contribution, we developed MIRAGAA, a statistical approach to assess the coordinated changes of genome copy numbers and microRNA (miRNA) expression. Genetic diseases like cancer evolve through microevolution where random lesions that provide the biggest advantage to the diseases can stand out in their frequent occurrence in multiple samples. At the same time, a gene function can be changed by aberration of the corresponding gene or modification of expression levels of microRNA which attenuates the gene. In a large number of disease samples, these two mechanisms might be distributed in a coordinated and almost mutually exclusive manner. Understanding this coordination may assist in identifying changes which significantly produce the same functional impact on cancer phenotype, and further identify genes that are universally required for cancer. MIRAGAA has been evaluated on the cancer genome atlas (TCGA) Glioblastoma Multiforme datasets. In these datasets, a number of genome regions coordinating with different miRNAs are identified. Although well known for their biological significance, these genes and miRNAs would be left undetected for being not significant enough if the two datasets were analysed individually. Genes can show significant changes in their expression levels when genetically diseased cells are compared with non-diseased cells. Biological networks are often used to analyse the genetic expression profiles to identify active subnetworks (ASNs) in the diseases. Existing methodologies for discovering ASNs mostly use node centric approaches and undirected PPI networks. This can limit their ability to find the most meaningful ASNs. As our second contribution, we developed Bionet which aims to identify better ASNs by using (i) integrated regulatory networks, (ii) directions of regulations of genes, and (iii) combined node and edge scores. We simplify and extend previous methodologies to incorporate edge evaluations and lessen their sensitivity to significance thresholds. We formulate our objective functions using mixed integer linear programming (MIP) and show that optimal solutions may be obtained. As our third contribution, we integrated and analysed the disease datasets of glioma, glioblastoma and breast cancer with pathways and biological networks. Our analysis of two independent breast cancer datasets finds that the basal subtype of this cancer contains positive feedback loops across 7 genes, AR, ESR1, MYC, E2F2, PGR, BCL2 and CCND1 that could potentially explain the aggressive nature of this cancer subtype. A comparison of the basal subtype of breast cancer and the mesenchymal subtype of glioblastoma ASNs shows that an ASN in the vicinity of IL6 is conserved across the two subtypes. CD44 is found to be the most outcome predictor gene in both glioblastoma and breast cancer and may be used as biomarker. Our analysis suggests that cancer subtypes from two different cancers can show molecular similarities that are identifiable by using integrated biological networks.
  • Item
    Thumbnail Image
    Scalable approaches for analysing high density single nucleotide polymorphism array data
    Wong, Gerard Kum Peng ( 2012)
    Prior to making inferences from the raw data produced by these microarrays, several challenges need to be addressed. First, it is important to limit the impact of noise on microarray measurements while maintaining data integrity. An unexplored aspect of noise is the extent of probeset sequence identity in SNP microarrays. Second, microarray-based datasets often have at least two orders of magnitude more probesets than the number of samples they describe. This poses a challenge for traditional statistical tests when used in this context. Third, the number of features in each dataset is large even when sample sizes are small, thus computationally efficient approaches are required to analyse these datasets. Finally, with improving resolution of SNP arrays, there is a need to exploit this improvement in resolution to identify finer-scaled mutations in the human genome. Most existing approaches deal with these challenges at an individual sample level and do not look for consensus change across the population to identify sites of DNA mutation. Other approaches artificially smooth or segment the data to obtain uniform segments of copy number change, and lose possible fine-scaled copy number changes in the process. Others are based on computationally expensive approaches that do not scale well to array resolution and sample size. Our first contribution is a comprehensive survey of the sequence identity of all probesets for all variants of the Affymetrix Genome-Wide Human SNP array. This survey assesses the target uniqueness of every probeset and provides a basis for the development of a set of gold standard labels of copy number change between genders. The derived sets of gold standard labels are a benchmark for assessing the performance of various algorithms in detecting recurrent copy number change. This benchmark is utilised in the evaluation of our second and third contribution. Our second contribution is a statistical approach called Detecting Recurrent Copy Number Changes Using Rank Order Statistics (DRECS), which is designed to identify regions of consensus copy number change across multiple samples in SNP array datasets. Through the use of rank-based statistics, DRECS efficiently draws on the statistical power of using multiple samples to identify fine-scaled copy number changes down to the width of a single probe in a computationally efficient way. Our third contribution is called Sum of Ranks Exact Test (SoRET), a non-parametric extension of DRECS. SoRET addresses SNP datasets with small sample sizes and makes no assumptions about the distribution from which the data was sampled. Its performance in terms of Type I and Type II errors is superior to competitive parametric and non-parametric statistical tests for small sample sizes. Our fourth contribution is a feature set reduction approach called FSR. FSR enables existing filter-based feature selection approaches to handle large dimensional microarray-type datasets by pruning irrelevant and redundant features. A novel scoring measure is developed to assess the strength of each feature in terms of sample class discrimination. FSR uses measures of entropy to efficiently gauge the contribution of higher order feature patterns to avoid combinatorial explosions in assessing the utility of features. In our tested datasets, classifiers trained on features selected from FSR-reduced feature sets have shown notably better predictive accuracy than classifiers trained on features selected from complete feature sets.
  • Item
    Thumbnail Image
    Compression of large DNA databases
    Kuruppu, Shanika Sewwandini ( 2012)
    The thesis explores algorithms to efficiently store and access repetitive DNA sequence collections produced by large-scale genome sequencing projects. First, existing general-purpose and DNA compression algorithms are evaluated for their suitability for compressing large collections of DNA sequences. Then two novel algorithms for compressing large collections of DNA sequences are introduced. The first algorithm is COMRAD, which is a disk-based dictionary compression algorithm that iteratively detects repetitions that occur across multiple sequences, and substitutes them with non-terminals. The method showed that repeats can be feasibly detected across multiple sequences for relatively large collections, while preserving sequence boundaries. The second algorithm is RLZ, which compresses highly similar sequence collections using a simple LZ77 parsing of each sequence with respect to a sequence chosen to be the reference. RLZ was also extended to conduct non-greedy LZ77 parsing, and with the combination of a few other optimisations, the algorithm indirectly detects and encodes approximate matches. RLZ is memory efficient, and is one of the fastest DNA sequence compression algorithms existing to date, both in terms of the compression and decompression speed. Both COMRAD and RLZ can compress sequence collections of fully assembled chromosomes and genomes, as well as sets of contigs and reads. Both algorithms also support individual sequence extraction and random access queries on compressed sets. RLZ was also extended to a full self-index by enabling substring searching on compressed sets with some limitations on detecting substrings. Since the effectiveness of RLZ compression depends on the reference sequence chosen for a collection, techniques for constructing reference sequences that better represent the collection were also explored. The results showed that using a reference sequence constructed with repeats detected by dictionary compressors leads to significant improvements in the compressed sizes produced by RLZ.