Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 150
  • Item
    Thumbnail Image
    Towards a semantic lexicon for biological language processing
    Verspoor, K (HINDAWI LTD, 2005)
    This paper explores the use of the resources in the National Library of Medicine's Unified Medical Language System (UMLS) for the construction of a lexicon useful for processing texts in the field of molecular biology. A lexicon is constructed from overlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurus to obtain both morphosyntactic and semantic information for terms, and the coverage of a domain corpus is assessed. Over 77% of tokens in the domain corpus are found in the constructed lexicon, validating the lexicon's coverage of the most frequent terms in the domain and indicating that the constructed lexicon is potentially an important resource for biological text processing.
  • Item
    Thumbnail Image
    Structuring Documents Efficiently
    MARSHALL, RGJ ; BIRD, SG ; STUCKEY, PJ (University of Sydney, 2005)
  • Item
    Thumbnail Image
    A classification-based framework for learning object assembly
    Farmer, R. A. ; Hughes, B. (IEEE Computer Society Press, 2005)
    Relations between learning outcomes and the learning objects which are assembled to facilitate their achievement are the subject of increasingly prevalent investigation, particularly with approaches which advocate the aggregation of learning objects as complex constituencies for achieving learning outcomes. From the perspective of situated learning, we show how the CASE framework imbues learning objects with a closed set of properties which can be classified and aggregated into learning object assemblies in a principled fashion. We argue that the computational and pedagogical tractability of this model provides a new insight into learning object evaluation, and hence learning outcomes.
  • Item
    Thumbnail Image
    NICTA i2d2 at GeoCLEF 2005
    HUGHES, BADEN ( 2005)
    This paper describes the participation of the Interactive Information Discovery and Delivery (i2d2) project of National ICT Australia (NICTA) in the GeoCLEF track of the Cross Language Evaluation Forum 2005. We present some background information about NICTA i2d2 project to motivate our involvement; describing our systems and experimental interests. We review the design of our runs and the results of our submitted and subsequent experiments; and contribute a range of suggestions for future instantiations of a geospatial information retrieval track within a shared evaluation task framework.
  • Item
    Thumbnail Image
    Towards a Web search service for minority language communities
    HUGHES, BADEN (State Library of Victoria, 2006)
    Locating resources of interest on the web in the general case is at best a low precision activity owing to the large number of pages on the web (for example, Google covers more than 8 billion web pages). As language communities (at all points on the spectrum) increasingly self-publish materials on the web, so interested users are beginning to search for them in the same way that they search for general internet resources, using broad coverage search engines with typically simple queries. Given that language resources are in a minority case on the web in general, finding relevant materials for low density or lesser used languages on the web is in general an increasingly inefficient exercise even for experienced searchers. Furthermore, the inconsistent coverage of web content between search engines serves to complicate matters even more. A number of previous research efforts have focused on using web data to create language corpora, mine linguistic data, building language ontologies, create thesaurii etc. The work reported in this paper contrasts with previous research in that it is not specifically oriented towards creation of language resources from web data directly, but rather, increasing the likelihood that end users searching for resources in minority languages will actually find useful results from web searches. Similarly, it differs from earlier work by virtue of its focus on search optimization directly, rather than as a component of a larger process (other researchers use the seed URIs discovered via the mechanism described in this paper in their own varied work). The work here can be seen to contribute to a user-centric agenda for locating language resources for lesser-used languages on the web. (From Introduction)
  • Item
    Thumbnail Image
    Searching for language resources on the Web: user behaviour in the Open Language Archives Community
    HUGHES, BADEN (European Language Resources Association, 2006)
    While much effort is expended in the curation of language resources, such investment is largely irrelevant if users cannot locate resources of interest. The Open Language Archives Community (OLAC) was established to define standards for the description of language resources and provide core infrastructure for a virtual digital library, thus addressing the resource discovery issue. In this paper we consider naturalistic user search behaviour in the Open Language Archives Community. Specifically, we have collected the query logs from the OLAC Search Engine over a 2 year period, collecting in excess of 1.2 million queries, in over 450K user search sessions. Subsequently we have mined these to discover user search patterns of various types, all pertaining to the discovery of language resources. A number of interesting observations can be made based on this analysis, in this paper we report on a range of properties and behaviours based on empirical evidence.
  • Item
    Thumbnail Image
    Gold as a standard for linguistic data interoperation: a road map for development
    Simons, G. F. ; Hughes, B. ( 2006)
    GOLD, the General Ontology for Linguistic Description [1], has somewhat unexpectedly emerged from the EMELD project. Originally conceived of as a morphosyntactic annotation inventory and label mapping scheme, GOLD has now been formalized as an ontology by which disparate data sets can be integrated through a common representation of the basic linguistic features.The overall vision of the GOLD Community is that:"By agreeing on a shared ONTOLOGY of linguistic concepts and on a shared infrastructure for INTEROPERATION, the linguistics community will be able to produce RESOURCES that describe individual languages in a comparable way, to develop TOOLS that produce these comparable resources, and to query SERVICES that aggregate as many comparable resources as are available." [2]In the EMELD context, a significant amount of effort has been invested in the development of GOLD in the first dimension of this vision, namely a shared collection of linguistic concepts. Initial surveying work been completed to glean linguistic concepts and their definitions from published materials. This survey work has been complemented by web data mining activities [3] to further increase the coverage of GOLD. GOLD has been instantiated in several formal versions, and a range of proof of concept implementations have featured at previous EMELD events [4, 5, 6] and other venues [7].However the latter four items from the GOLD Community vision (to achieve interoperation through resources, tools, and services) remain largely unaddressed, and thus there remains considerable effort to be expended in achieving the vision in its entirity. Upon reflection, we believe that there are presently three significant barriers to the widespread adoption of GOLD and subsequent realization of the interoperation goals, vis: * the complexity of the dissemination format which in effect places the threshold for engagement with GOLD at too high a level; * the absence of a well defined change process
  • Item
    Thumbnail Image
    Metadata Challenges for Situational Properties of Learning Objects
    HUGHES, B ; FARMER, RA (IEEE Computer Society, 2006)
  • Item
    Thumbnail Image
    Collecting low-density language materials on the Web
    Baldwin, Timothy ; BIRD, STEPHEN ; HUGHES, BADEN (Southern Cross University, 2006)
    Most web content exists in a few dozen languages. Hundreds of other languages - the `low-density languages' - are only represented in scarce quantities on the web. How can we locate, store and describe these low-density resources? In particular, how can we identify linguistically interesting resources, such as translation sets and multilingual documents? In this paper we describe ongoing research in which we integrate a number of discrete systems (language data crawler, automated metadata generation tools, language data repositories and federated search services) to address the identification, retrieval, description, storage and access issues for low-density language materials from the web.
  • Item
    Thumbnail Image
    Frontiers in Linguistic Annotation for Lower-Density Languages
    Maxwell, M. ; Hughes, B. (Association for Computational Linguistics, 2006)
    The languages that are most commonly subject to linguistic annotation on a large scale tend to be those with the largest pop- ulations or with recent histories of lin- guistic scholarship. In this paper we dis- cuss the problems associated with lower- density languages in the context of the de- velopment of linguistically annotated re- sources. We frame our work with three key questions regarding the definition of lower-density languages; increasing avail- able resources and reducing data require- ments. A number of steps forward are identified for increasing the number lower- density language corpora with linguistic annotations.