Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 83
  • Item
    Thumbnail Image
    Towards a semantic lexicon for biological language processing
    Verspoor, K (HINDAWI LTD, 2005)
    This paper explores the use of the resources in the National Library of Medicine's Unified Medical Language System (UMLS) for the construction of a lexicon useful for processing texts in the field of molecular biology. A lexicon is constructed from overlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurus to obtain both morphosyntactic and semantic information for terms, and the coverage of a domain corpus is assessed. Over 77% of tokens in the domain corpus are found in the constructed lexicon, validating the lexicon's coverage of the most frequent terms in the domain and indicating that the constructed lexicon is potentially an important resource for biological text processing.
  • Item
    Thumbnail Image
    Protein annotation as term categorization in the gene ontology using word proximity networks
    Verspoor, K ; Cohn, J ; Joslyn, C ; Mniszewski, S ; Rechtsteiner, A ; Rocha, LM ; Simas, T (BMC, 2005-05-24)
    BACKGROUND: We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO. RESULTS: The evaluation results indicate that the method for expanding words associated with GO nodes is quite powerful; we were able to successfully select appropriate evidence text for a given annotation in 38% of Task 2.1 queries by building on this method. The term categorization methodology achieved a precision of 16% for annotation within the correct extended family in Task 2.2, though we show through subsequent analysis that this can be improved with a different parameter setting. Our architecture proved not to be very successful on the evidence text component of the task, in the configuration used to generate the submitted results. CONCLUSION: The initial results show promise for both of the methods we explored, and we are planning to integrate the methods more closely to achieve better results overall.
  • Item
    Thumbnail Image
    Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot
    Ehrler, F ; Geissbühler, A ; Jimeno, A ; Ruch, P (BMC, 2005-05-24)
    BACKGROUND: In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignment; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignment of a set of categories. METHODS: Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. RESULTS: Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. CONCLUSION: From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.
  • Item
    Thumbnail Image
    Structuring Documents Efficiently
    MARSHALL, RGJ ; BIRD, SG ; STUCKEY, PJ (University of Sydney, 2005)
  • Item
    Thumbnail Image
    A classification-based framework for learning object assembly
    Farmer, R. A. ; Hughes, B. (IEEE Computer Society Press, 2005)
    Relations between learning outcomes and the learning objects which are assembled to facilitate their achievement are the subject of increasingly prevalent investigation, particularly with approaches which advocate the aggregation of learning objects as complex constituencies for achieving learning outcomes. From the perspective of situated learning, we show how the CASE framework imbues learning objects with a closed set of properties which can be classified and aggregated into learning object assemblies in a principled fashion. We argue that the computational and pedagogical tractability of this model provides a new insight into learning object evaluation, and hence learning outcomes.
  • Item
    Thumbnail Image
    NICTA i2d2 at GeoCLEF 2005
    HUGHES, BADEN ( 2005)
    This paper describes the participation of the Interactive Information Discovery and Delivery (i2d2) project of National ICT Australia (NICTA) in the GeoCLEF track of the Cross Language Evaluation Forum 2005. We present some background information about NICTA i2d2 project to motivate our involvement; describing our systems and experimental interests. We review the design of our runs and the results of our submitted and subsequent experiments; and contribute a range of suggestions for future instantiations of a geospatial information retrieval track within a shared evaluation task framework.
  • Item
  • Item
    Thumbnail Image
    LPath+: A first-order complete languagefor linguistic tree query
    Lai, C. ; Bird, S. G. (Academia Sinica, 2005)
    Annotated linguistic databases are widely used in linguistic research and inlanguage technology development. These annotations are typically hierarchical,and represent the nested structure of syntactic and prosodic constituents. Recently,the LPath language has been proposed as a convenient path-based language forquerying linguistic trees. We establish the formal expressiveness of LPath relativeto the XPath family of languages. We also extend LPath to permit simple closures,resulting in a first-order complete language which we believe is sufficientlyexpressive for the majority of linguistic tree query needs.
  • Item
    Thumbnail Image
    Automatic utterance segmentation in Instant Messaging dialogue
    Ivanovic, Edward (Australasian Language Technology Association, 2005)
    Instant Messaging (IM) chat sessions are real-time, text-based conversations which can be analyzed using dialogue-act models.Dialogue acts represent the semantic information of an utterance, however, messages must be segmented into utterances before classification can take place. We describe and compare two statistical methods for automatic utterance segmentation and dialogue-act classification in task-based IM dialogue. It is shown that IM messages can be automatically segmented and classified to a very high accuracy using statistical machine learning.
  • Item
    Thumbnail Image
    A distributed architecture for interactive parse annotation
    HUGHES, BADEN ; Haggerty, James ; Manickam, Saritha ; Nothman, Joel ; Curran, James R. (Australasian Language Technology Association, 2005)
    In this paper we describe a modular system architecture for distributed parse annotation using interactive correction. This involves interactively adding constraints to an existing parse until the returned parse is correct. Using a mixed initiative approach, human annotators interact live with distributed CCG parser servers through an annotation gui. The examples presented to each annotator are selected by an active learning framework to maximise the value of the annotated corpus for machine learners. We report on an initial implementation based on a distributed workflow architecture.