Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 17
  • Item
    Thumbnail Image
    Developing systems for gene normalisation
    Goudey, Benjamin ( 2007-10)
    The rapid growth of biomedical literature has attracted interest from the text mining community to develop methods to help manage the ever-increasing amounts of data. Initiatives such as the BioCreative challenge (Hirschman et al. 2005b) have created standard corpora and tasks in which to evaluate a variety of systems in a common framework. One such task is gene normalisation, in which the problems of synonymy and polysemy in gene name identification are overcome by mapping each mention back to a unique identifier, unambiguously identifying that gene. This task is one of the foundations required for any kind of text mining system working with biomedical literature, where we must be very certain of which genes are being discussed in the text. (For complete abstract open document)
  • Item
    Thumbnail Image
    Context-sensitive glossing of Japanese webpages
    Yap, Willy ( 2007-10)
    This thesis proposes a method for automatically sense-to-sense aligning dictionaries in different languages (focusing on Japanese and English), based on structural data in the respective dictionaries. The basis of the proposed method is sentence similarity of the sense definition sentences, using a bilingual Japanese-to-English dictionary as a pivot during the alignment process. We experiment with various extensions to the basic method, including term weighting, stemming/lemmatisation, and ontology expansion.
  • Item
    Thumbnail Image
    Dynamically detecting and modelling the visitor’s interests in a museum environment
    Yang, Michael ( 2007-10)
    A museum contains a diversity of information sources, from which a visitor can select according to his interests. Although every visitor may have a different purpose for visiting the museum, in our research we assume the motivation for all visits is to get information. However, more often than not, visitors do not obtain a thorough coverage of information on the topics of their interests, given a limited time duration for the tour. To maximise this coverage, the solution is to increase our knowledge of each individual’s information needs, and develop a system to provide personalised and contextually relevant navigation to the visitor. The personalisation component ensures the provided information is adaptive to the interests of each unique visitor, whereas the contextual relevance component looks at the state of the visitor in relation to their surroundings. In this project, we develop a prototype system that aims to address both components; in particular, we experiment with various computational linguistic models on the prototype system and contrast their effectiveness in detecting and modelling the visitor’s interests during the museum tour.
  • Item
    Thumbnail Image
    The production of meaningful route directions using landmark extraction
    Furlan, Aidan Thomas ( 2006-11)
    In order for automated navigation systems to operate effectively, the route instructions they produce must be clear, concise and easily understood by users. Thequality and coherence of route instructions may be improved via landmark chunking, whereby a turning instruction is given with reference to a nearby landmark. In order to incorporate a landmark within a coherent sentence, it is necessary to first understand how that landmark is conceptualised by travellers — whether it is perceived as point-like, line-like or area-like. This conceptualisation determines which prepositions and verbs are appropriate when referring to the landmark. This thesis investigates the viability of automatically classifying the conceptualisation of landmarks relative to a given city context. First, we construct a web-based annotation interface to solicit gold-standard judgements from expert annotators over a set of landmarks for three major cities (Melbourne, Hamburg and Tokyo). We then experiment with the use of web data to learn the default conceptualisation of those landmarks, analysing their occurrence in a fixed set of lexico-syntactic patterns. Based on this, we develop two automated landmark classifiers and evaluate them against the gold standard annotations,investigating patterns of convergence or divergence in landmark categorisation.
  • Item
    Thumbnail Image
    Analysis and prediction of user behaviour in a museum environment
    GRIESER, KARL ( 2006)
    Visitors to a museum enter an environment with a wealth of information. However not all of this information may be located in physical form. It may be accessible through an online web-site and available for download, or this information could be presented by a tour guide that leads you through the museum. Neither of these sources of information allow a visitor to deviate from a set path. If a visitor leaves a guided tour, they will not have access to the resource that is supplying them with the extra information. If they deviate from a downloaded tour, they again will not have the correct information sheets for an exhibit that is not directly on their tour. The solution is to create a Recommender System based on the conceptual similarity of the exhibits. This system creates a dynamic tour through the museum for a given visitor by recommending exhibits that the visitor is interested in. Conceptual similarity of exhibits can be comprised of elements including the physical proximity, the semantic content of the exhibit, and the opinions of previous visitors. By using a combination of these similarities, we have produced a system that recommends relevant exhibits in 51% of test cases.
  • Item
    Thumbnail Image
    Network intrusion detection techniques for single source and coordinated scans
    ZHANG, DANA ( 2005-10)
    A prelude to most malicious network attacks involves a systematic scan on a target network. Scans attempt to gather intelligence on the internal structure of a network with the aim to find weaknesses to exploit. Various algorithms have been developed to identify these types of intrusions, however there is no heuristic to confirm the accuracy of these results. Current algorithms only deal with attackers scanning from single sources with no consideration for attackers that may be working from multiple locations. This thesis addresses the need for a conclusive evaluation technique and the need to effectively detect coordinated scans. Two innovative algorithms have been developed. The first is an improved comparison technique for current single scan detection algorithms that can accurately measure the false positive rate and precision or identified scanners. The second is a coordinated scan detection algorithm that is capable of correctly identifying sets of sources that are working in collusion to explore the topology of a network.
  • Item
    Thumbnail Image
    Adaptive psychophysical procedures and Ulam's game
    KELAREVA, ELENA ( 2006-10)
    The problem of finding the threshold of a psychometric function is of major interest to researchers in the field of psychophysics, and has applications in many areas of medical diagnosis, particularly those dealing with perception, such as optometry and hearing tests. This problem is closely related to other problems in computer science, such as search with errors, however most existing literature does not make this link. This thesis provides a review of existing algorithms for finding the threshold, with an emphasis on identifying the types of problems for which each algorithm is useful. We also address a number of issues which are not adequately covered in the literature. These include choosing an appropriate loss function to evaluate the performance of an algorithm for a given problem, as well as relating the problem of finding the threshold to binary search with errors problems in computer science. Finally, this research presents a new algorithm for finding the threshold of a psychometric function, ENT-FIRST, which results in improved performance compared to many existing algorithms.
  • Item
    Thumbnail Image
    Computational gene finding in the human malaria parasite Plasmodium vivax
    Stivala, Alexander David ( 2006-10)
    Different approaches to genome annotation are reviewed and compared with reference based annotation using GeneMapper in the human malaria parasite Plasmodium vivax. It is found that the latter approach does not achieve sensitivity and specificity as high as those for some ab initio techniques. Potential reasons for this are identified and discussed. As part of the process of using GeneMapper, codon substitution matrices are constructed and examined. This leads to the discovery of evidence from which we derive a conjecture regarding Plasmodium evolution.
  • Item
    Thumbnail Image
    Statistical interpretation of compound nouns
    NICHOLSON, JEREMY ( 2005-10)
    We present a method for detecting compound nominalisations in open data, and deriving an interpretation for them. Discovering the semantic relationship between the modifier and head noun in a compound nominalisation is first construed as a two-way disamiguation task between an underlying subject or object semantic relation between a head noun and its modifier, and second as a three-way task between subject, direct object, and prepositional object relations. The detection method achieves about 89% recall on a data set annotated by way of Celex and Nomlex, and about 70% recall on a randomly-sampled data set based on the British National Corpus, with 77% recall on detecting a more general set of compound nouns from this data. The interpretation method achieved about 72% accuracy in the two-way task, and 57% in the three-way task, using a statistical measure based on z-scores - the confidence interval - in selecting one of the relations. Our proposed method has the advantage over previous research in that it can act over open data to detect and interpret compound nominalisations, as opposed to only operating in a limited domain or requiring hand-selection or hand-tuning.
  • Item
    Thumbnail Image
    The effects of part-of-speech tagsets on tagger performance
    MACKINLAY, ANDREW ( 2005-11)
    In natural language processing (NLP), a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with part-of-speech labels corresponding to categories such as noun, verb or adjective. Mainstream approaches are generally corpus-based: a POS tagger learns from a corpus of pre-annotated data how to correctly tag unlabelled data. Previous work has tended to focus on applying new algorithms to the problem of adding hand-tuned features to assist in classifying difficult instances. Using these methods, a number of distinct approaches have plateaued to similar accuracy figures of 96.9 ± 0.3%. Here we approach the problem of improving accuracy in POS tagging from a unique angle. We use a representative set of tagging algorithms and attempt to optimise performance by modifying the inventory of tags (or tagset) used in the pre-labelled training data . We modify tagsets by systematically mapping the tags of the training data to anew tagset. Our aim is to produce a tagset which is more conducive to automatic POS tagging by more accurately reflecting the underlying linguistic distinctions which should be encoded in a tagset. The mappings are reversible, enabling the original tags to be trivially recovered, which facilitates comparison with previous work and between competing mappings. We explore two different broad sources of these mappings. Our primary focus is on using linguistic insight to determine potentially useful distinctions which we can then evaluate empirically. We also evaluate an alternative data-driven approach for extracting patterns of regularity in a tagged corpus. Our experiments indicate the approach is not as successful as we had predicted. Our most successful mappings were data-driven, which give improvements of approximately0.01% in token level accuracy over the development set using specific taggers, with increments of 0.03% over the test set. We show a wide range of linguistically motivated modifications which cause a performance decrement, while the best linguistic approaches maintain performance approximately over the development data and produce up to 0.05%improvement over the development data. Our results lead us to believe that this line of research is unlikely to provide significant gains over conventional approaches to POS tagging.