Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Statistical Approaches for Entity Resolution under Uncertainty
    Marchant, Neil Grant ( 2020)
    When real-world entities are referenced in data, their identities are often obscured. This presents a problem for data cleaning and integration, as references to an entity may be scattered across multiple records or sources, without a means to identify and consolidate them. Entity resolution (ER; also known as record linkage and deduplication) seeks to address this problem by linking references to the same entity, based on imprecise information. It has diverse applications: from resolving references to individuals in administrative data for public health research, to resolving product listings on the web for a shopping aggregation service. While many methods have been developed to automate the ER process, it can be difficult to guarantee accurate results for a number of reasons, such as poor data quality, heterogeneity across data sources, and lack of ground truth. It is therefore essential to recognise and account for sources of uncertainty throughout the ER process. In this thesis, I explore statistical approaches for managing uncertainty—both in quantifying the uncertainty of ER predictions, and in evaluating the accuracy of ER to high precision. In doing so, I focus on methods that require minimal input from humans as a source of ground truth. This is important, as many ER methods require vast quantities of human-labelled data to achieve sufficient accuracy. In the first part of this thesis, I focus on Bayesian models for ER, owing to their ability to capture uncertainty, and their robustness in settings where labelled training data is limited. I identify scalability as a major obstacle to the use of Bayesian ER models in practice, and propose a suite of methods aimed at improving the scalability of an ER model proposed by Steorts (2015). These methods include an auxiliary variable scheme for probabilistic blocking, a distributed partially-collapsed Gibbs sampler, and fast algorithms for performing Gibbs updates. I also propose modelling refinements, aimed at improving ER accuracy and reducing sensitivity to hyperparameters. These refinements include the use of Ewens-Pitman random partitions as a prior on the linkage structure, corrections to logic in the record distortion model and an additional level of priors to improve flexibility. I then turn to the problem of ER evaluation, which is particularly challenging due to the fact that coreferent pairs of records (which refer to the same entity) are extremely rare. As a result, estimates of ER performance typically exhibit high levels of statistical uncertainty, as they are most sensitive to the rare coreferent (and predicted coreferent) pairs of records. In order to address this challenge, I propose a framework for online supervised evaluation based on adaptive importance sampling. Given a target performance measure and set of ER systems to evaluate, the framework adaptively selects pairs of records to label in order to approximately minimise statistical uncertainty. Under verifiable conditions on the performance measure and adaptive policy, I establish strong consistency and a central limit theorem for the resulting performance estimates. I conduct empirical studies, which demonstrate that the framework can yield dramatic reductions in labelling requirements when estimating ER performance to a fixed precision.
  • Item
    Thumbnail Image
    Probabilistic models for aggregating crowdsourced annotations
    Li, Yuan ( 2019)
    This thesis explores aggregation methods for crowdsourced annotations. Crowdsourcing is a popular means of creating training and evaluation datasets for machine learning, e.g. used for computer vision, natural language processing, and information retrieval, at a low cost and in a timely manner. However, due to low-quality annotations, individual workers cannot be wholly trusted to provide reliable annotations and consequently items are typically redundantly labelled by different workers, with labels aggregated subsequently. Although many aggregation methods have been proposed to jointly learn non-uniform weights to workers and infer the truth, the simplest aggregation method, majority voting (MV) which grants workers equal votes towards consensus, still predominates in practice. To find explanations to the predominance of MV, we conduct extensive experiments of evaluation on 19 datasets and identify two shortcomings that prevent existing methods from being applied in practice over the simple MV. A key finding is that most methods don’t significantly outperform MV across all datasets. These methods may achieve higher mean accuracy than MV does but are also outperformed by MV on several datasets. A secondary shortcoming is that several methods require slow and cumbersome inference, which doesn’t scale to large datasets that are common in practice. To address the identified shortcomings, we propose two novel aggregation methods both of which significantly outperform MV. The first is a Bayesian version of a weighted average model. It learns unknown voting weights of workers in a principled way by estimating their posterior, unlike existing weighted average models that rely on heuristic update rules or optimising handcrafted objectives. The second approach, complementary to the above, is another Bayesian model that captures the correlations between worker labels which most existing models completely ignore or assume don’t exist. Learning the correlations also helps the method achieve the highest mean accuracy among all methods compared in our experiments. When applying aggregation methods in practice, it’s typically assumed that the only information we have is worker labels, but in many situations more information is available. For the setting where item content is available, e.g. feature vectors, we propose a novel model for aggregating binary labels using a Boltzmann machine prior to bias similar instances towards sharing the same label. We also show further gains by integrating a proposed active learning heuristic. We also consider a second, related, setting where instances are sentences, the task is annotating which words in the sentence denote a named entity, structural outputs from a few classifiers are given, and the goal is ensembling those classifiers. We discuss the strategy of adapting aggregation methods for crowdsourcing into this setting. We also discuss the effect of a few gold labels on truth inference and approaches for effectively leveraging gold labels.