Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Duplication in biological databases: definitions, impacts and methods
    Chen, Qingyu ( 2017)
    Duplication is a pressing issue in biological databases. This thesis concerns duplication, in terms of its definitions (what records are duplicates), impacts (why duplicates are significant) and solutions (how to address duplication). The volume of biological databases is growing at an unprecedented rate, populated by complex records drawn from heterogeneous sources; the huge data volume and the diverse types cause concern for the underlying data quality. A specific challenge is duplication, that is, the presence of redundant or inconsistent records. While existing studies concern duplicates, the definitions of duplicates are not clear; the foundational understanding of what records are considered as duplicates by database stakeholders is lacking. The impacts of duplication are not clear either; existing studies have different or even inconsistent views on the impacts. The unclear definitions and impacts of duplication in biological databases further limit the development of the related duplicate detection methods. In this work, we refine the definitions of duplication in biological databases through a retrospective analysis of merged groups in primary nucleotide databases – the duplicates identified by record submitters and database staff (or biocurators) – to understand what types of duplicates matter to database stakeholders. This reveals two primary representations of duplication under the context of biological databases: entity duplicates, multiple records belonging to the same entities, which particularly impact record submission and curation, and near duplicates (or redundant records), records sharing high similarities, particularly impact database search. The analysis also reveals different types of duplicate records, showing that database stakeholders are concerned with diverse types of duplicates in reality, whereas previous studies mainly consider records with very high similarities as duplicates. Following this foundational analysis, we investigate both primary representations. For entity duplicate, we establish three large-scale benchmarks of labelled duplicates from different perspectives (submitter-based, expert curation and automatic curation), assess the effectiveness of an existing method, and develop a new supervised learning method that detects duplicates more precisely than previous approaches. For near duplicates, we assess the effectiveness and the efficiency of the best known clustering-based methods in terms of database search results diversity (whether retrieved results are independently informative) and completeness (whether retrieved results miss potentially important records after de-duplication), and propose suggestions and solutions for more effective biological database search.