Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 6 of 6
  • Item
    Thumbnail Image
    Benchmarks for measurement of duplicate detection methods in nucleotide databases
    Chen, Q ; Zobel, J ; Verspoor, K (OXFORD UNIV PRESS, 2023-12-18)
    UNLABELLED: Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. DATABASE URL: : https://bitbucket.org/biodbqual/benchmarks.
  • Item
    Thumbnail Image
    Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
    Chen, Q ; Zobel, J ; Verspoor, K (OXFORD UNIV PRESS, 2017-01-10)
    GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC-a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases - in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w.
  • Item
    Thumbnail Image
    BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics
    Chen, Q ; Panyam, NC ; Elangovan, A ; Verspoor, K (OXFORD UNIV PRESS, 2018-12-14)
    Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.
  • Item
    Thumbnail Image
    Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases.
    Chen, Q ; Britto, R ; Erill, I ; Jeffery, CJ ; Liberzon, A ; Magrane, M ; Onami, J-I ; Robinson-Rechavi, M ; Sponarova, J ; Zobel, J ; Verspoor, K (Elsevier, 2020-04)
    Biological databases represent an extraordinary collective volume of work. Diligently built up over decades and comprising many millions of contributions from the biomedical research community, biological databases provide worldwide access to a massive number of records (also known as entries) [1]. Starting from individual laboratories, genomes are sequenced, assembled, annotated, and ultimately submitted to primary nucleotide databases such as GenBank [2], European Nucleotide Archive (ENA) [3], and DNA Data Bank of Japan (DDBJ) [4] (collectively known as the International Nucleotide Sequence Database Collaboration, INSDC). Protein records, which are the translations of these nucleotide records, are deposited into central protein databases such as the UniProt KnowledgeBase (UniProtKB) [5] and the Protein Data Bank (PDB) [6]. Sequence records are further accumulated into different databases for more specialized purposes: RFam [7] and PFam [8] for RNA and protein families, respectively; DictyBase [9] and PomBase [10] for model organisms; as well as ArrayExpress [11] and Gene Expression Omnibus (GEO) [12] for gene expression profiles. These databases are selected as examples; the list is not intended to be exhaustive. However, they are representative of biological databases that have been named in the “golden set” of the 24th Nucleic Acids Research database issue (in 2016). The introduction of that issue highlights the databases that “consistently served as authoritative, comprehensive, and convenient data resources widely used by the entire community and offer some lessons on what makes a successful database” [13]. In addition, the associated information about sequences is also propagated into non-sequence databases, such as PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) for scientific literature or Gene Ontology (GO) [14] for function annotations. These databases in turn benefit individual studies, many of which use these publicly available records as the basis for their own research.
  • Item
    Thumbnail Image
    Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine
    Dogan, RI ; Kim, S ; Chatr-aryamontri, A ; Wei, C-H ; Comeau, DC ; Antunes, R ; Matos, S ; Chen, Q ; Elangovan, A ; Panyam, NC ; Verspoor, K ; Liu, H ; Wang, Y ; Liu, Z ; Altinel, B ; Husunbeyi, ZM ; Ozgur, A ; Fergadis, A ; Wang, C-K ; Dai, H-J ; Tran, T ; Kavuluru, R ; Luo, L ; Steppi, A ; Zhang, J ; Qu, J ; Lu, Z (Oxford University Press, 2019-01-28)
    The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein–protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
  • Item
    Thumbnail Image
    Supervised Learning for Detection of Duplicates in Genomic Sequence Databases
    Chen, Q ; Zobel, J ; Zhang, X ; Verspoor, K ; Robinson-Rechavi, M (PUBLIC LIBRARY SCIENCE, 2016-08-04)
    MOTIVATION: First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. RESULTS: We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.