Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 11
  • Item
    Thumbnail Image
    The production of meaningful route directions using landmark extraction
    Furlan, Aidan Thomas ( 2006-11)
    In order for automated navigation systems to operate effectively, the route instructions they produce must be clear, concise and easily understood by users. Thequality and coherence of route instructions may be improved via landmark chunking, whereby a turning instruction is given with reference to a nearby landmark. In order to incorporate a landmark within a coherent sentence, it is necessary to first understand how that landmark is conceptualised by travellers — whether it is perceived as point-like, line-like or area-like. This conceptualisation determines which prepositions and verbs are appropriate when referring to the landmark. This thesis investigates the viability of automatically classifying the conceptualisation of landmarks relative to a given city context. First, we construct a web-based annotation interface to solicit gold-standard judgements from expert annotators over a set of landmarks for three major cities (Melbourne, Hamburg and Tokyo). We then experiment with the use of web data to learn the default conceptualisation of those landmarks, analysing their occurrence in a fixed set of lexico-syntactic patterns. Based on this, we develop two automated landmark classifiers and evaluate them against the gold standard annotations,investigating patterns of convergence or divergence in landmark categorisation.
  • Item
    Thumbnail Image
    Analysis and prediction of user behaviour in a museum environment
    GRIESER, KARL ( 2006)
    Visitors to a museum enter an environment with a wealth of information. However not all of this information may be located in physical form. It may be accessible through an online web-site and available for download, or this information could be presented by a tour guide that leads you through the museum. Neither of these sources of information allow a visitor to deviate from a set path. If a visitor leaves a guided tour, they will not have access to the resource that is supplying them with the extra information. If they deviate from a downloaded tour, they again will not have the correct information sheets for an exhibit that is not directly on their tour. The solution is to create a Recommender System based on the conceptual similarity of the exhibits. This system creates a dynamic tour through the museum for a given visitor by recommending exhibits that the visitor is interested in. Conceptual similarity of exhibits can be comprised of elements including the physical proximity, the semantic content of the exhibit, and the opinions of previous visitors. By using a combination of these similarities, we have produced a system that recommends relevant exhibits in 51% of test cases.
  • Item
    Thumbnail Image
    Network intrusion detection techniques for single source and coordinated scans
    ZHANG, DANA ( 2005-10)
    A prelude to most malicious network attacks involves a systematic scan on a target network. Scans attempt to gather intelligence on the internal structure of a network with the aim to find weaknesses to exploit. Various algorithms have been developed to identify these types of intrusions, however there is no heuristic to confirm the accuracy of these results. Current algorithms only deal with attackers scanning from single sources with no consideration for attackers that may be working from multiple locations. This thesis addresses the need for a conclusive evaluation technique and the need to effectively detect coordinated scans. Two innovative algorithms have been developed. The first is an improved comparison technique for current single scan detection algorithms that can accurately measure the false positive rate and precision or identified scanners. The second is a coordinated scan detection algorithm that is capable of correctly identifying sets of sources that are working in collusion to explore the topology of a network.
  • Item
    Thumbnail Image
    Adaptive psychophysical procedures and Ulam's game
    KELAREVA, ELENA ( 2006-10)
    The problem of finding the threshold of a psychometric function is of major interest to researchers in the field of psychophysics, and has applications in many areas of medical diagnosis, particularly those dealing with perception, such as optometry and hearing tests. This problem is closely related to other problems in computer science, such as search with errors, however most existing literature does not make this link. This thesis provides a review of existing algorithms for finding the threshold, with an emphasis on identifying the types of problems for which each algorithm is useful. We also address a number of issues which are not adequately covered in the literature. These include choosing an appropriate loss function to evaluate the performance of an algorithm for a given problem, as well as relating the problem of finding the threshold to binary search with errors problems in computer science. Finally, this research presents a new algorithm for finding the threshold of a psychometric function, ENT-FIRST, which results in improved performance compared to many existing algorithms.
  • Item
    Thumbnail Image
    Computational gene finding in the human malaria parasite Plasmodium vivax
    Stivala, Alexander David ( 2006-10)
    Different approaches to genome annotation are reviewed and compared with reference based annotation using GeneMapper in the human malaria parasite Plasmodium vivax. It is found that the latter approach does not achieve sensitivity and specificity as high as those for some ab initio techniques. Potential reasons for this are identified and discussed. As part of the process of using GeneMapper, codon substitution matrices are constructed and examined. This leads to the discovery of evidence from which we derive a conjecture regarding Plasmodium evolution.
  • Item
    Thumbnail Image
    Statistical interpretation of compound nouns
    NICHOLSON, JEREMY ( 2005-10)
    We present a method for detecting compound nominalisations in open data, and deriving an interpretation for them. Discovering the semantic relationship between the modifier and head noun in a compound nominalisation is first construed as a two-way disamiguation task between an underlying subject or object semantic relation between a head noun and its modifier, and second as a three-way task between subject, direct object, and prepositional object relations. The detection method achieves about 89% recall on a data set annotated by way of Celex and Nomlex, and about 70% recall on a randomly-sampled data set based on the British National Corpus, with 77% recall on detecting a more general set of compound nouns from this data. The interpretation method achieved about 72% accuracy in the two-way task, and 57% in the three-way task, using a statistical measure based on z-scores - the confidence interval - in selecting one of the relations. Our proposed method has the advantage over previous research in that it can act over open data to detect and interpret compound nominalisations, as opposed to only operating in a limited domain or requiring hand-selection or hand-tuning.
  • Item
    Thumbnail Image
    The effects of part-of-speech tagsets on tagger performance
    MACKINLAY, ANDREW ( 2005-11)
    In natural language processing (NLP), a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with part-of-speech labels corresponding to categories such as noun, verb or adjective. Mainstream approaches are generally corpus-based: a POS tagger learns from a corpus of pre-annotated data how to correctly tag unlabelled data. Previous work has tended to focus on applying new algorithms to the problem of adding hand-tuned features to assist in classifying difficult instances. Using these methods, a number of distinct approaches have plateaued to similar accuracy figures of 96.9 ± 0.3%. Here we approach the problem of improving accuracy in POS tagging from a unique angle. We use a representative set of tagging algorithms and attempt to optimise performance by modifying the inventory of tags (or tagset) used in the pre-labelled training data . We modify tagsets by systematically mapping the tags of the training data to anew tagset. Our aim is to produce a tagset which is more conducive to automatic POS tagging by more accurately reflecting the underlying linguistic distinctions which should be encoded in a tagset. The mappings are reversible, enabling the original tags to be trivially recovered, which facilitates comparison with previous work and between competing mappings. We explore two different broad sources of these mappings. Our primary focus is on using linguistic insight to determine potentially useful distinctions which we can then evaluate empirically. We also evaluate an alternative data-driven approach for extracting patterns of regularity in a tagged corpus. Our experiments indicate the approach is not as successful as we had predicted. Our most successful mappings were data-driven, which give improvements of approximately0.01% in token level accuracy over the development set using specific taggers, with increments of 0.03% over the test set. We show a wide range of linguistically motivated modifications which cause a performance decrement, while the best linguistic approaches maintain performance approximately over the development data and produce up to 0.05%improvement over the development data. Our results lead us to believe that this line of research is unlikely to provide significant gains over conventional approaches to POS tagging.
  • Item
    Thumbnail Image
    Natural language as an agent communication language
    MARCH, OLIVIA ( 2004)
    Intelligent agents should be able to communicate with each other using an extensible, expressive language. Agents should have the ability work together in a heterogeneous environment to solve complex goals, while, acting on their own initiative and maintaining autonomy. Current agent communication languages are not expressive enough to facilitate coordination of agents in a heterogeneous system. Natural languages, such as English have evolved to become expressive enough to advance the human race to be the dominant species. It has been refined over millenia and is proven extensible. This research demonstrates the feasibility of using natural language as an agent communication language for intelligent agents solving a collaborative task
  • Item
    Thumbnail Image
    Combining part of speech induction and morphological induction
    Wilson, Charlotte ( 2004-11)
    Linguistic information is useful in natural language processing, information retrieval and a multitude of sub-tasks involving language analysis. Two types of linguistic information in all languages are part of speech and morphology. Part of speech information reflects syntactic structure and can assist in tasks such as speech recognition, machine translation and word sense disambiguation. Morphological information describes the structure of words and has application in automated spelling correction, natural language generation and information retrieval for morphologically complex languages. Machine learning methods in natural language processing acquire linguistic information from corpora of natural language text. While supervised learning algorithms are trained on texts that have been annotated with linguistic features, induction algorithms learn linguistic information from unannotated corpora. Such algorithms avoid any requirement for linguistically annotated training data - a resource that is highly time-intensive to produce. However, in learning from unannotated corpora, only limited sources of information are available. In practice, part of speech induction methods usually learn from distributional evidence about the contexts in which words occur. In contrast, morphological induction methods tend to be based on the orthographic structure of the words in the corpus. However, a word’s morphological form and syntactic function often correlate: a word’s morphology may indicate its syntactic function and vice versa. Thus, both distributional and orthographic evidence may be useful for both tasks. This thesis investigates the extent to which the information induced by one learner can be used to bootstrap the other: specifically, whether the incorporation of explicit annotations from one learner can improve the performance of the other.
  • Item
    Thumbnail Image
    XSLT as a linguistic query language
    Taylor, Claire Louise ( 2003-11)
    With the growing use of linguistic data, suitable storage techniques and query languages need to be developed. A traditional relational database management system is inappropriate for linguistic data as it typically has some sort of structure associated with it, which can represent hierarchical or sequential relationships. Although there are many different forms of linguistic annotation, there are few query languages that succinctly service the data by providing the necessary features such as data accessibility, transformation and integration. The current challenge facing the creators of linguistic corpora and the corresponding query languages is to find a query language that is expressive enough to enable the features mentioned above while still providing an interface to the data that allows the corpus to be queried in terms of the user’s conceptual model. Previous work in this area has suggested that the hierarchical nature of XML would be well suited to linguistic data and that an existing XML query language could be applied to linguistic queries. This thesis represented two linguistic corpora, TIMIT and the Penn Treebank in XML. Two possible XML representations for TIMIT were explored to illustrate that a permutation in the structure of the data has a significant effect on the ease of writing queries for it. Data structures that were closely related to the user’s conceptual model of the data for a given query were easier to write queries for. It was concluded that the final XML representation for a given corpus would depend on the possible uses of the data.