Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Effective integration of diverse biological datasets for better understanding of cancer
    Gaire, Raj Kumar ( 2012)
    Cancer is a disease of malfunctioning cells. Nowadays, experiments in cancer research have been producing a large number of datasets that contain measurements of various aspects of cancer. Similarly, datasets in cellular biology are becoming better organised and increasingly available. An effective integration of these datasets to understand the mechanisms of cancers is a challenging task. In this research, we develop novel integration methods and apply them to some diverse datasets of cancer. Our analysis finds that subtypes of cancers share common features that may be useful to direct cancer biologists to find better cure of cancers. As our first contribution, we developed MIRAGAA, a statistical approach to assess the coordinated changes of genome copy numbers and microRNA (miRNA) expression. Genetic diseases like cancer evolve through microevolution where random lesions that provide the biggest advantage to the diseases can stand out in their frequent occurrence in multiple samples. At the same time, a gene function can be changed by aberration of the corresponding gene or modification of expression levels of microRNA which attenuates the gene. In a large number of disease samples, these two mechanisms might be distributed in a coordinated and almost mutually exclusive manner. Understanding this coordination may assist in identifying changes which significantly produce the same functional impact on cancer phenotype, and further identify genes that are universally required for cancer. MIRAGAA has been evaluated on the cancer genome atlas (TCGA) Glioblastoma Multiforme datasets. In these datasets, a number of genome regions coordinating with different miRNAs are identified. Although well known for their biological significance, these genes and miRNAs would be left undetected for being not significant enough if the two datasets were analysed individually. Genes can show significant changes in their expression levels when genetically diseased cells are compared with non-diseased cells. Biological networks are often used to analyse the genetic expression profiles to identify active subnetworks (ASNs) in the diseases. Existing methodologies for discovering ASNs mostly use node centric approaches and undirected PPI networks. This can limit their ability to find the most meaningful ASNs. As our second contribution, we developed Bionet which aims to identify better ASNs by using (i) integrated regulatory networks, (ii) directions of regulations of genes, and (iii) combined node and edge scores. We simplify and extend previous methodologies to incorporate edge evaluations and lessen their sensitivity to significance thresholds. We formulate our objective functions using mixed integer linear programming (MIP) and show that optimal solutions may be obtained. As our third contribution, we integrated and analysed the disease datasets of glioma, glioblastoma and breast cancer with pathways and biological networks. Our analysis of two independent breast cancer datasets finds that the basal subtype of this cancer contains positive feedback loops across 7 genes, AR, ESR1, MYC, E2F2, PGR, BCL2 and CCND1 that could potentially explain the aggressive nature of this cancer subtype. A comparison of the basal subtype of breast cancer and the mesenchymal subtype of glioblastoma ASNs shows that an ASN in the vicinity of IL6 is conserved across the two subtypes. CD44 is found to be the most outcome predictor gene in both glioblastoma and breast cancer and may be used as biomarker. Our analysis suggests that cancer subtypes from two different cancers can show molecular similarities that are identifiable by using integrated biological networks.
  • Item
    Thumbnail Image
    Toward semantic interoperability for software systems
    Lister, Kendall ( 2008)
    “In an ill-structured domain you cannot, by definition, have a pre-compiled schema in your mind for every circumstance and context you may find ... you must be able to flexibly select and arrange knowledge sources to most efficaciously pursue the needs of a given situation.” [57] In order to interact and collaborate effectively, agents, whether human or software, must be able to communicate through common understandings and compatible conceptualisations. Ontological differences that occur either from pre-existing assumptions or as side-effects of the process of specification are a fundamental obstacle that must be overcome before communication can occur. Similarly, the integration of information from heterogeneous sources is an unsolved problem. Efforts have been made to assist integration, through both methods and mechanisms, but automated integration remains an unachieved goal. Communication and information integration are problems of meaning and interaction, or semantic interoperability. This thesis contributes to the study of semantic interoperability by identifying, developing and evaluating three approaches to the integration of information. These approaches have in common that they are lightweight in nature, pragmatic in philosophy and general in application. The first work presented is an effort to integrate a massive, formal ontology and knowledge-base with semi-structured, informal heterogeneous information sources via a heuristic-driven, adaptable information agent. The goal of the work was to demonstrate a process by which task-specific knowledge can be identified and incorporated into the massive knowledge-base in such a way that it can be generally re-used. The practical outcome of this effort was a framework that illustrates a feasible approach to providing the massive knowledge-base with an ontologically-sound mechanism for automatically generating task-specific information agents to dynamically retrieve information from semi-structured information sources without requiring machine-readable meta-data. The second work presented is based on reviving a previously published and neglected algorithm for inferring semantic correspondences between fields of tables from heterogeneous information sources. An adapted form of the algorithm is presented and evaluated on relatively simple and consistent data collected from web services in order to verify the original results, and then on poorly-structured and messy data collected from web sites in order to explore the limits of the algorithm. The results are presented via standard measures and are accompanied by detailed discussions on the nature of the data encountered and an analysis of the strengths and weaknesses of the algorithm and the ways in which it complements other approaches that have been proposed. Acknowledging the cost and difficulty of integrating semantically incompatible software systems and information sources, the third work presented is a proposal and a working prototype for a web site to facilitate the resolving of semantic incompatibilities between software systems prior to deployment, based on the commonly-accepted software engineering principle that the cost of correcting faults increases exponentially as projects progress from phase to phase, with post-deployment corrections being significantly more costly than those performed earlier in a project’s life. The barriers to collaboration in software development are identified and steps taken to overcome them. The system presented draws on the recent collaborative successes of social and collaborative on-line projects such as SourceForge, Del.icio.us, digg and Wikipedia and a variety of techniques for ontology reconciliation to provide an environment in which data definitions can be shared, browsed and compared, with recommendations automatically presented to encourage developers to adopt data definitions compatible with previously developed systems. In addition to the experimental works presented, this thesis contributes reflections on the origins of semantic incompatibility with a particular focus on interaction between software systems, and between software systems and their users, as well as detailed analysis of the existing body of research into methods and techniques for overcoming these problems.