Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Investigating the evolution of structural variation in cancer
    Cmero, Marek ( 2017)
    Cancers arise from single progenitor cells that acquire mutations, eventually dividing into mixed populations with distinct genotypes. These populations can be estimated by identifying common mutational profiles, using computational techniques applied to sequencing data from tumour tissue samples. Existing methods have largely focused on single nucleotide variants (SNVs), despite growing evidence of the importance of structural variation (SV) as drivers in certain subtypes of cancer. While some approaches use copy-number aberrant SVs, no method has incorporated balanced rearrangements. To address this, I developed a Bayesian inference approach for estimating SV cancer cell fraction called SVclone. I validated SVclone using in silico mixtures of real samples in known proportions and found that clonal deconvolution using SV breakpoints can yield comparable results to SNV-based clustering. I then applied the method to 2,778 whole-genomes across 39 distinct tumour types, uncovering a subclonal copy-number neutral rearrangement phenotype with decreased overall survival. This clinically relevant finding could not have been found using existing methods. To further expand the methodology, and demonstrate its application to low data quality contexts, I developed a novel statistical approach to test for clonal differences in high-variance, formalin-fixed, paraffin-embedded (FFPE) samples. Together with variant curation strategies to minimise FFPE artefact, I applied the approach to longitudinal samples from a cohort of neo-adjuvant treated prostate cancer patients to investigate whether clonal differences can be inferred in highly noisy data. This thesis demonstrates that characterising the evolution of structural variation, particularly balanced rearrangements, results in clinically relevant insights. Identifying the patterns and dynamics of structural variation in the context of tumour evolution will ultimately help improve understanding of common pathways of tumour progression. Through this knowledge, cancers driven by SVs will have clearer prognoses and clinical treatment decisions will ultimately be improved, leading to better patient outcomes.
  • Item
    Thumbnail Image
    Effective integration of diverse biological datasets for better understanding of cancer
    Gaire, Raj Kumar ( 2012)
    Cancer is a disease of malfunctioning cells. Nowadays, experiments in cancer research have been producing a large number of datasets that contain measurements of various aspects of cancer. Similarly, datasets in cellular biology are becoming better organised and increasingly available. An effective integration of these datasets to understand the mechanisms of cancers is a challenging task. In this research, we develop novel integration methods and apply them to some diverse datasets of cancer. Our analysis finds that subtypes of cancers share common features that may be useful to direct cancer biologists to find better cure of cancers. As our first contribution, we developed MIRAGAA, a statistical approach to assess the coordinated changes of genome copy numbers and microRNA (miRNA) expression. Genetic diseases like cancer evolve through microevolution where random lesions that provide the biggest advantage to the diseases can stand out in their frequent occurrence in multiple samples. At the same time, a gene function can be changed by aberration of the corresponding gene or modification of expression levels of microRNA which attenuates the gene. In a large number of disease samples, these two mechanisms might be distributed in a coordinated and almost mutually exclusive manner. Understanding this coordination may assist in identifying changes which significantly produce the same functional impact on cancer phenotype, and further identify genes that are universally required for cancer. MIRAGAA has been evaluated on the cancer genome atlas (TCGA) Glioblastoma Multiforme datasets. In these datasets, a number of genome regions coordinating with different miRNAs are identified. Although well known for their biological significance, these genes and miRNAs would be left undetected for being not significant enough if the two datasets were analysed individually. Genes can show significant changes in their expression levels when genetically diseased cells are compared with non-diseased cells. Biological networks are often used to analyse the genetic expression profiles to identify active subnetworks (ASNs) in the diseases. Existing methodologies for discovering ASNs mostly use node centric approaches and undirected PPI networks. This can limit their ability to find the most meaningful ASNs. As our second contribution, we developed Bionet which aims to identify better ASNs by using (i) integrated regulatory networks, (ii) directions of regulations of genes, and (iii) combined node and edge scores. We simplify and extend previous methodologies to incorporate edge evaluations and lessen their sensitivity to significance thresholds. We formulate our objective functions using mixed integer linear programming (MIP) and show that optimal solutions may be obtained. As our third contribution, we integrated and analysed the disease datasets of glioma, glioblastoma and breast cancer with pathways and biological networks. Our analysis of two independent breast cancer datasets finds that the basal subtype of this cancer contains positive feedback loops across 7 genes, AR, ESR1, MYC, E2F2, PGR, BCL2 and CCND1 that could potentially explain the aggressive nature of this cancer subtype. A comparison of the basal subtype of breast cancer and the mesenchymal subtype of glioblastoma ASNs shows that an ASN in the vicinity of IL6 is conserved across the two subtypes. CD44 is found to be the most outcome predictor gene in both glioblastoma and breast cancer and may be used as biomarker. Our analysis suggests that cancer subtypes from two different cancers can show molecular similarities that are identifiable by using integrated biological networks.
  • Item
    Thumbnail Image
    Detecting gene-cancer associations by analysing gene-expression microarrays
    Shi, Fan ( 2011)
    Modern bioinformatics studies have shown that many physiological and behavioral characteristics in organisms are influenced by the information encoded in genes. Cancer genomics is a subject that specifically studies the genetic mechanisms of the formation and progression of cancer in order to improve the diagnosis and prognosis of cancer. An important task in cancer genomics is to detect gene-cancer associations, which is the biological focus of this thesis. DNA microarrays are a high throughput bioinformatics technique to assay the expression levels for thousands of genes simultaneously. By comparing the gene expression levels between different cancer types or subtypes, we may discover potential gene-cancer associations. However, the large scale of gene expression microarrays requires effective and efficient computational approaches for their analysis. Thus, from a computational perspective, we focus on developing computational approaches to detect gene-cancer associations based on gene expression microarrays in this thesis. We summarise three key problems and the corresponding traditional methods in the analysis of gene expression microarrays. First, statistical tests can be used to identify differentially expressed genes, which show significantly different expression patterns between different cancer types or subtypes. Second, unsupervised clustering methods can be used to identify co-expressed genes, which are groups of differentially expressed genes that show similar expression levels in the same cancer types. Third, classification methods can be used to predict the types or subtypes of cancer based on differentially expressed genes. However, several challenges, such as the high dimensionality in microarrays and the small sample sizes in cancer studies, may lead to low accuracy and efficiency problems for analysis based on these traditional methods. In this thesis, we have proposed three computational approaches to address the above problems. First, we have developed a meta-analysis method, called Incomplete Gene Meta-analysis (IGM), to identify differentially expressed genes by integrating multiple studies. The IGM method is able to integrate datasets from different microarray platforms and incorporate the genes that are not measured in all datasets, which we refer to as incomplete genes. In our evaluation, we verify the importance of including incomplete genes, and the experimental results demonstrate that IGM identifies more significant genes by imputing the statistical significance of incomplete genes than traditional methods. Second, we have proposed an unsupervised Bi-ordering Analysis (BOA) method to detect local patterns, where a subset of genes are co-expressed under a subset of samples, called biclusters, in microarrays. The BOA method uses an iterative process to identify consistently over or under-expressed gene groups in specific samples. This approach addresses several challenges for detecting biclusters, including the identification of biologically meaningful patterns, the efficiency of biclustering algorithms and the stability of biclusters. Our statistical assessments demonstrate both the statistical and biological significance of the biclusters produced by our method. Third, we have proposed a method for making multiple predictions with an associated confidence level to classify the cancers of unknown primary origin (CUP). This classification method is able to identify a set of the most likely cancer types and assign a confidence level to the predictions for CUP samples. Our method for making multiple predictions takes into account the biological similarity in different cancer types at a gene expression level, and is thus more suitable for classifying multi-class cancer samples than making a single class prediction. Our evaluation verifies the importance of making multiple predictions and validates our method.