University Library
  • Login
A gateway to Melbourne's research publications
Minerva Access is the University's Institutional Repository. It aims to collect, preserve, and showcase the intellectual output of staff and students of the University of Melbourne for a global audience.
View Item 
  • Minerva Access
  • Chancellery
  • Chancellery Research - Research Publications
  • View Item
  • Minerva Access
  • Chancellery
  • Chancellery Research - Research Publications
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

    Literature consistency of bioinformatics sequence databases is effective for assessing record quality

    Thumbnail
    Download
    Published version (2.324Mb)

    Citations
    Scopus
    Web of Science
    Altmetric
    5
    1
    Author
    Bouadjenek, MR; Verspoor, K; Zobel, J
    Date
    2017-03-18
    Source Title
    Database: the journal of biological databases and curation
    Publisher
    OXFORD UNIV PRESS
    University of Melbourne Author/s
    Bouadjenek, Mohamed Reda; Zobel, Justin; Verspoor, Cornelia
    Affiliation
    Chancellery Research
    Computing and Information Systems
    Metadata
    Show full item record
    Document Type
    Journal Article
    Citations
    Bouadjenek, M. R., Verspoor, K. & Zobel, J. (2017). Literature consistency of bioinformatics sequence databases is effective for assessing record quality. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2017 (1), https://doi.org/10.1093/database/bax021.
    Access Status
    Open Access
    URI
    http://hdl.handle.net/11343/258423
    DOI
    10.1093/database/bax021
    ARC Grant code
    ARC/DP150101550
    Abstract
    Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics.

    Export Reference in RIS Format     

    Endnote

    • Click on "Export Reference in RIS Format" and choose "open with... Endnote".

    Refworks

    • Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References


    Collections
    • Minerva Elements Records [45770]
    • Computing and Information Systems - Research Publications [1456]
    • Chancellery Research - Research Publications [398]
    Minerva AccessDepositing Your Work (for University of Melbourne Staff and Students)NewsFAQs

    BrowseCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects
    My AccountLoginRegister
    StatisticsMost Popular ItemsStatistics by CountryMost Popular Authors