Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 328
  • Item
    Thumbnail Image
    Stratification bias in low signal microarray studies
    Parker, BJ ; Guenter, S ; Bedo, J (BMC, 2007-09-02)
    BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.
  • Item
    Thumbnail Image
    Towards a semantic lexicon for biological language processing
    Verspoor, K (HINDAWI LTD, 2005)
    This paper explores the use of the resources in the National Library of Medicine's Unified Medical Language System (UMLS) for the construction of a lexicon useful for processing texts in the field of molecular biology. A lexicon is constructed from overlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurus to obtain both morphosyntactic and semantic information for terms, and the coverage of a domain corpus is assessed. Over 77% of tokens in the domain corpus are found in the constructed lexicon, validating the lexicon's coverage of the most frequent terms in the domain and indicating that the constructed lexicon is potentially an important resource for biological text processing.
  • Item
    Thumbnail Image
    Protein annotation as term categorization in the gene ontology using word proximity networks
    Verspoor, K ; Cohn, J ; Joslyn, C ; Mniszewski, S ; Rechtsteiner, A ; Rocha, LM ; Simas, T (BMC, 2005-05-24)
    BACKGROUND: We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO. RESULTS: The evaluation results indicate that the method for expanding words associated with GO nodes is quite powerful; we were able to successfully select appropriate evidence text for a given annotation in 38% of Task 2.1 queries by building on this method. The term categorization methodology achieved a precision of 16% for annotation within the correct extended family in Task 2.2, though we show through subsequent analysis that this can be improved with a different parameter setting. Our architecture proved not to be very successful on the evidence text component of the task, in the configuration used to generate the submitted results. CONCLUSION: The initial results show promise for both of the methods we explored, and we are planning to integrate the methods more closely to achieve better results overall.
  • Item
    Thumbnail Image
    Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot
    Ehrler, F ; Geissbühler, A ; Jimeno, A ; Ruch, P (BMC, 2005-05-24)
    BACKGROUND: In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignment; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignment of a set of categories. METHODS: Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. RESULTS: Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. CONCLUSION: From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.
  • Item
    Thumbnail Image
    Measuring Success
    CULLEN, S ; Willocks, L (Caspian Publishing, 2007)
  • Item
  • Item
    Thumbnail Image
    Structuring Documents Efficiently
    MARSHALL, RGJ ; BIRD, SG ; STUCKEY, PJ (University of Sydney, 2005)
  • Item
    Thumbnail Image
    A classification-based framework for learning object assembly
    Farmer, R. A. ; Hughes, B. (IEEE Computer Society Press, 2005)
    Relations between learning outcomes and the learning objects which are assembled to facilitate their achievement are the subject of increasingly prevalent investigation, particularly with approaches which advocate the aggregation of learning objects as complex constituencies for achieving learning outcomes. From the perspective of situated learning, we show how the CASE framework imbues learning objects with a closed set of properties which can be classified and aggregated into learning object assemblies in a principled fashion. We argue that the computational and pedagogical tractability of this model provides a new insight into learning object evaluation, and hence learning outcomes.
  • Item
    Thumbnail Image
    NICTA i2d2 at GeoCLEF 2005
    HUGHES, BADEN ( 2005)
    This paper describes the participation of the Interactive Information Discovery and Delivery (i2d2) project of National ICT Australia (NICTA) in the GeoCLEF track of the Cross Language Evaluation Forum 2005. We present some background information about NICTA i2d2 project to motivate our involvement; describing our systems and experimental interests. We review the design of our runs and the results of our submitted and subsequent experiments; and contribute a range of suggestions for future instantiations of a geospatial information retrieval track within a shared evaluation task framework.
  • Item
    Thumbnail Image
    Towards a Web search service for minority language communities
    HUGHES, BADEN (State Library of Victoria, 2006)
    Locating resources of interest on the web in the general case is at best a low precision activity owing to the large number of pages on the web (for example, Google covers more than 8 billion web pages). As language communities (at all points on the spectrum) increasingly self-publish materials on the web, so interested users are beginning to search for them in the same way that they search for general internet resources, using broad coverage search engines with typically simple queries. Given that language resources are in a minority case on the web in general, finding relevant materials for low density or lesser used languages on the web is in general an increasingly inefficient exercise even for experienced searchers. Furthermore, the inconsistent coverage of web content between search engines serves to complicate matters even more. A number of previous research efforts have focused on using web data to create language corpora, mine linguistic data, building language ontologies, create thesaurii etc. The work reported in this paper contrasts with previous research in that it is not specifically oriented towards creation of language resources from web data directly, but rather, increasing the likelihood that end users searching for resources in minority languages will actually find useful results from web searches. Similarly, it differs from earlier work by virtue of its focus on search optimization directly, rather than as a component of a larger process (other researchers use the seed URIs discovered via the mechanism described in this paper in their own varied work). The work here can be seen to contribute to a user-centric agenda for locating language resources for lesser-used languages on the web. (From Introduction)