Computing and Information Systems - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 20
  • Item
    Thumbnail Image
    Collecting low-density language materials on the Web
    Baldwin, Timothy ; BIRD, STEPHEN ; HUGHES, BADEN (Southern Cross University, 2006)
    Most web content exists in a few dozen languages. Hundreds of other languages - the `low-density languages' - are only represented in scarce quantities on the web. How can we locate, store and describe these low-density resources? In particular, how can we identify linguistically interesting resources, such as translation sets and multilingual documents? In this paper we describe ongoing research in which we integrate a number of discrete systems (language data crawler, automated metadata generation tools, language data repositories and federated search services) to address the identification, retrieval, description, storage and access issues for low-density language materials from the web.
  • Item
    Thumbnail Image
    Analysis and prediction of user behaviour in a museum environment
    Grieser, Karl ; Baldwin, Timothy ; Bird, Steven (Australasian Language Technology Association, 2006)
    N/A
  • Item
    Thumbnail Image
    Reconsidering language identification for written language resources
    HUGHES, BADEN ; BALDWIN, TIMOTHY ; BIRD, STEVEN ; NICHOLSON, JEREMY ; MACKINLAY, ANDREW (European Language Resources Association, 2006)
    The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approaches to written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain open and ripe for further investigation.
  • Item
    Thumbnail Image
    Open source corpus analysis tools for Malay
    BALDWIN, TIMOTHY ; Awab, Su'ad ( 2006)
    Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.
  • Item
    Thumbnail Image
    Looking for Prepositional Verbs in Corpus Data
    BALDWIN, TJ (Association for Computational Linguistics, 2005)
  • Item
    Thumbnail Image
    An unsupervised approach to interpreting noun compounds
    Su, NK ; Baldwin, T (IEEE, 2008-12-01)
  • Item
    Thumbnail Image
    MRD-based Word Sense Disambiguation: Further Extending Lesk
    BALDWIN, T ; KIM, S ; Bond, ; Fujita, ; MARTINEZ, D ; Tanaka, (Asian Federation of Natural Language Processing, 2008)
  • Item
    Thumbnail Image
    Disambiguating noun compounds
    Kim, SN ; Baldwin, T (Association for the Advancement of Artificial Intelligence, 2007-11-28)
  • Item
    Thumbnail Image
    DISTRIBUTIONAL SIMILARITY AND PREPOSITION SEMANTICS
    Baldwin, T ; SaintDizier, P (SPRINGER, 2006)
  • Item
    Thumbnail Image
    Beauty and the Beast: What Running a Broad-Coverage Precision Grammar over the BNC Taught Us about the Grammar — and the Corpus
    Baldwin, T ; Beavers, J ; Bender, EM ; Flickinger, D ; Kim, A ; Oepen, S ; REIS, M ; KEPSER, S (Mouton de Gruyter, 2005-12-15)