Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 52
  • Item
    Thumbnail Image
    The randomized information coefficient: assessing dependencies in noisy data
    Romano, S ; Vinh, NX ; Verspoor, K ; Bailey, J (SPRINGER, 2018-03)
  • Item
    No Preview Available
    BioC interoperability track overview
    Comeau, DC ; Batista-Navarro, RT ; Dai, H-J ; Dogan, RI ; Yepes, AJ ; Khare, R ; Lu, Z ; Marques, H ; Mattingly, CJ ; Neves, M ; Peng, Y ; Rak, R ; Rinaldi, F ; Tsai, RT-H ; Verspoor, K ; Wiegers, TC ; Wu, CH ; Wilbur, WJ (OXFORD UNIV PRESS, 2014-06-30)
    BioC is a new simple XML format for sharing biomedical text and annotations and libraries to read and write that format. This promotes the development of interoperable tools for natural language processing (NLP) of biomedical text. The interoperability track at the BioCreative IV workshop featured contributions using or highlighting the BioC format. These contributions included additional implementations of BioC, many new corpora in the format, biomedical NLP tools consuming and producing the format and online services using the format. The ease of use, broad support and rapidly growing number of tools demonstrate the need for and value of the BioC format. Database URL: http://bioc.sourceforge.net/.
  • Item
    No Preview Available
    Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations
    Liu, H ; Hunter, L ; Keselj, V ; Verspoor, K ; Smalheiser, NR (PUBLIC LIBRARY SCIENCE, 2013-04-17)
    The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.
  • Item
    No Preview Available
    BioC: a minimalist approach to interoperability for biomedical text processing
    Comeau, DC ; Dogan, RI ; Ciccarese, P ; Cohen, KB ; Krallinger, M ; Leitner, F ; Lu, Z ; Peng, Y ; Rinaldi, F ; Torii, M ; Valencia, A ; Verspoor, K ; Wiegers, TC ; Wu, CH ; Wilbur, WJ (OXFORD UNIV PRESS, 2013-09-18)
    A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/
  • Item
    Thumbnail Image
    Annotating the biomedical literature for the human variome
    Verspoor, K ; Yepes, AJ ; Cavedon, L ; McIntosh, T ; Herten-Crabb, A ; Thomas, Z ; Plazzer, J-P (OXFORD UNIV PRESS, 2013-04-12)
    This article introduces the Variome Annotation Schema, a schema that aims to capture the core concepts and relations relevant to cataloguing and interpreting human genetic variation and its relationship to disease, as described in the published literature. The schema was inspired by the needs of the database curators of the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is intended to have application to genetic variation information in a range of diseases. The schema has been applied to a small corpus of full text journal publications on the subject of inherited colorectal cancer. We show that the inter-annotator agreement on annotation of this corpus ranges from 0.78 to 0.95 F-score across different entity types when exact matching is measured, and improves to a minimum F-score of 0.87 when boundary matching is relaxed. Relations show more variability in agreement, but several are reliable, with the highest, cohort-has-size, reaching 0.90 F-score. We also explore the relevance of the schema to the InSiGHT database curation process. The schema and the corpus represent an important new resource for the development of text mining solutions that address relationships among patient cohorts, disease and genetic variation, and therefore, we also discuss the role text mining might play in the curation of information related to the human variome. The corpus is available at http://opennicta.com/home/health/variome.
  • Item
    Thumbnail Image
    BioLemmatizer: a lemmatization tool for morphological processing of biomedical text
    Liu, H ; Christiansen, T ; Baumgartner, WA ; Verspoor, K (BMC, 2012-04)
    BACKGROUND: The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research. RESULTS: In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the LLL05 corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system. CONCLUSIONS: The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from http://biolemmatizer.sourceforge.net.
  • Item
    Thumbnail Image
    A UIMA wrapper for the NCBO annotator
    Roeder, C ; Jonquet, C ; Shah, NH ; Baumgartner, WA ; Verspoor, K ; Hunter, L (OXFORD UNIV PRESS, 2010-07-15)
    SUMMARY: The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator-an ontology-based annotation service-to make it available as a component in UIMA workflows. AVAILABILITY: This wrapper is freely available on the web at http://bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.
  • Item
    Thumbnail Image
    U-Compare bio-event meta-service: compatible BioNLP event extraction services
    Kano, Y ; Bjoerne, J ; Ginter, F ; Salakoski, T ; Buyko, E ; Hahn, U ; Cohen, KB ; Verspoor, K ; Roeder, C ; Hunter, LE ; Kilicoglu, H ; Bergler, S ; Van Landeghem, S ; Van Parys, T ; Van de Peer, Y ; Miwa, M ; Ananiadou, S ; Neves, M ; Pascual-Montano, A ; Ozgur, A ; Radev, DR ; Riedel, S ; Saetre, R ; Chun, H-W ; Kim, J-D ; Pyysalo, S ; Ohta, T ; Tsujii, J (BMC, 2011-12-18)
    BACKGROUND: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. RESULTS: We have integrated nine event extraction systems in the U-Compare framework, making them intercompatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. CONCLUSIONS: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.
  • Item
    Thumbnail Image
    A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools
    Verspoor, K ; Cohen, KB ; Lanfranchi, A ; Warner, C ; Johnson, HL ; Roeder, C ; Choi, JD ; Funk, C ; Malenkiy, Y ; Eckert, M ; Xue, N ; Baumgartner, WA ; Bada, M ; Palmer, M ; Hunter, LE (BMC, 2012-08-17)
    BACKGROUND: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. RESULTS: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. CONCLUSIONS: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
  • Item
    Thumbnail Image
    Representing annotation compositionality and provenance for the Semantic Web
    Livingston, KM ; Bada, M ; Hunter, LE ; Verspoor, K (BMC, 2013-11)
    BACKGROUND: Though the annotation of digital artifacts with metadata has a long history, the bulk of that work focuses on the association of single terms or concepts to single targets. As annotation efforts expand to capture more complex information, annotations will need to be able to refer to knowledge structures formally defined in terms of more atomic knowledge structures. Existing provenance efforts in the Semantic Web domain primarily focus on tracking provenance at the level of whole triples and do not provide enough detail to track how individual triple elements of annotations were derived from triple elements of other annotations. RESULTS: We present a task- and domain-independent ontological model for capturing annotations and their linkage to their denoted knowledge representations, which can be singular concepts or more complex sets of assertions. We have implemented this model as an extension of the Information Artifact Ontology in OWL and made it freely available, and we show how it can be integrated with several prominent annotation and provenance models. We present several application areas for the model, ranging from linguistic annotation of text to the annotation of disease-associations in genome sequences. CONCLUSIONS: With this model, progressively more complex annotations can be composed from other annotations, and the provenance of compositional annotations can be represented at the annotation level or at the level of individual elements of the RDF triples composing the annotations. This in turn allows for progressively richer annotations to be constructed from previous annotation efforts, the precise provenance recording of which facilitates evidence-based inference and error tracking.