Show simple item record

dc.contributor.authorChen, Q
dc.contributor.authorPanyam, NC
dc.contributor.authorElangovan, A
dc.contributor.authorVerspoor, K
dc.date.accessioned2020-12-10T01:05:28Z
dc.date.available2020-12-10T01:05:28Z
dc.date.issued2018-12-14
dc.identifierpii: 5255181
dc.identifier.citationChen, Q., Panyam, N. C., Elangovan, A. & Verspoor, K. (2018). BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2018, https://doi.org/10.1093/database/bay122.
dc.identifier.issn1758-0463
dc.identifier.urihttp://hdl.handle.net/11343/253654
dc.description.abstractPrecision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.
dc.languageEnglish
dc.publisherOXFORD UNIV PRESS
dc.titleBioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics
dc.typeJournal Article
dc.identifier.doi10.1093/database/bay122
melbourne.affiliation.departmentComputing and Information Systems
melbourne.source.titleDatabase: the journal of biological databases and curation
melbourne.source.volume2018
melbourne.identifier.arcDP150101550
dc.rights.licenseCC BY
melbourne.elementsid1364002
melbourne.contributor.authorVerspoor, Cornelia
melbourne.contributor.authorChen, Qingyu
melbourne.contributor.authorPanyam Chandrasekarasastry, Nagesh
melbourne.contributor.authorElangovan, Aparna
dc.identifier.eissn1758-0463
melbourne.identifier.fundernameidAUST RESEARCH COUNCIL, DP150101550
melbourne.accessrightsOpen Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record