Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Statistical Approaches for Entity Resolution under Uncertainty
    Marchant, Neil Grant ( 2020)
    When real-world entities are referenced in data, their identities are often obscured. This presents a problem for data cleaning and integration, as references to an entity may be scattered across multiple records or sources, without a means to identify and consolidate them. Entity resolution (ER; also known as record linkage and deduplication) seeks to address this problem by linking references to the same entity, based on imprecise information. It has diverse applications: from resolving references to individuals in administrative data for public health research, to resolving product listings on the web for a shopping aggregation service. While many methods have been developed to automate the ER process, it can be difficult to guarantee accurate results for a number of reasons, such as poor data quality, heterogeneity across data sources, and lack of ground truth. It is therefore essential to recognise and account for sources of uncertainty throughout the ER process. In this thesis, I explore statistical approaches for managing uncertainty—both in quantifying the uncertainty of ER predictions, and in evaluating the accuracy of ER to high precision. In doing so, I focus on methods that require minimal input from humans as a source of ground truth. This is important, as many ER methods require vast quantities of human-labelled data to achieve sufficient accuracy. In the first part of this thesis, I focus on Bayesian models for ER, owing to their ability to capture uncertainty, and their robustness in settings where labelled training data is limited. I identify scalability as a major obstacle to the use of Bayesian ER models in practice, and propose a suite of methods aimed at improving the scalability of an ER model proposed by Steorts (2015). These methods include an auxiliary variable scheme for probabilistic blocking, a distributed partially-collapsed Gibbs sampler, and fast algorithms for performing Gibbs updates. I also propose modelling refinements, aimed at improving ER accuracy and reducing sensitivity to hyperparameters. These refinements include the use of Ewens-Pitman random partitions as a prior on the linkage structure, corrections to logic in the record distortion model and an additional level of priors to improve flexibility. I then turn to the problem of ER evaluation, which is particularly challenging due to the fact that coreferent pairs of records (which refer to the same entity) are extremely rare. As a result, estimates of ER performance typically exhibit high levels of statistical uncertainty, as they are most sensitive to the rare coreferent (and predicted coreferent) pairs of records. In order to address this challenge, I propose a framework for online supervised evaluation based on adaptive importance sampling. Given a target performance measure and set of ER systems to evaluate, the framework adaptively selects pairs of records to label in order to approximately minimise statistical uncertainty. Under verifiable conditions on the performance measure and adaptive policy, I establish strong consistency and a central limit theorem for the resulting performance estimates. I conduct empirical studies, which demonstrate that the framework can yield dramatic reductions in labelling requirements when estimating ER performance to a fixed precision.
  • Item
    Thumbnail Image
    Effective integration of diverse biological datasets for better understanding of cancer
    Gaire, Raj Kumar ( 2012)
    Cancer is a disease of malfunctioning cells. Nowadays, experiments in cancer research have been producing a large number of datasets that contain measurements of various aspects of cancer. Similarly, datasets in cellular biology are becoming better organised and increasingly available. An effective integration of these datasets to understand the mechanisms of cancers is a challenging task. In this research, we develop novel integration methods and apply them to some diverse datasets of cancer. Our analysis finds that subtypes of cancers share common features that may be useful to direct cancer biologists to find better cure of cancers. As our first contribution, we developed MIRAGAA, a statistical approach to assess the coordinated changes of genome copy numbers and microRNA (miRNA) expression. Genetic diseases like cancer evolve through microevolution where random lesions that provide the biggest advantage to the diseases can stand out in their frequent occurrence in multiple samples. At the same time, a gene function can be changed by aberration of the corresponding gene or modification of expression levels of microRNA which attenuates the gene. In a large number of disease samples, these two mechanisms might be distributed in a coordinated and almost mutually exclusive manner. Understanding this coordination may assist in identifying changes which significantly produce the same functional impact on cancer phenotype, and further identify genes that are universally required for cancer. MIRAGAA has been evaluated on the cancer genome atlas (TCGA) Glioblastoma Multiforme datasets. In these datasets, a number of genome regions coordinating with different miRNAs are identified. Although well known for their biological significance, these genes and miRNAs would be left undetected for being not significant enough if the two datasets were analysed individually. Genes can show significant changes in their expression levels when genetically diseased cells are compared with non-diseased cells. Biological networks are often used to analyse the genetic expression profiles to identify active subnetworks (ASNs) in the diseases. Existing methodologies for discovering ASNs mostly use node centric approaches and undirected PPI networks. This can limit their ability to find the most meaningful ASNs. As our second contribution, we developed Bionet which aims to identify better ASNs by using (i) integrated regulatory networks, (ii) directions of regulations of genes, and (iii) combined node and edge scores. We simplify and extend previous methodologies to incorporate edge evaluations and lessen their sensitivity to significance thresholds. We formulate our objective functions using mixed integer linear programming (MIP) and show that optimal solutions may be obtained. As our third contribution, we integrated and analysed the disease datasets of glioma, glioblastoma and breast cancer with pathways and biological networks. Our analysis of two independent breast cancer datasets finds that the basal subtype of this cancer contains positive feedback loops across 7 genes, AR, ESR1, MYC, E2F2, PGR, BCL2 and CCND1 that could potentially explain the aggressive nature of this cancer subtype. A comparison of the basal subtype of breast cancer and the mesenchymal subtype of glioblastoma ASNs shows that an ASN in the vicinity of IL6 is conserved across the two subtypes. CD44 is found to be the most outcome predictor gene in both glioblastoma and breast cancer and may be used as biomarker. Our analysis suggests that cancer subtypes from two different cancers can show molecular similarities that are identifiable by using integrated biological networks.