Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 9 of 9
  • Item
    Thumbnail Image
    Crowdsourcing lexical semantic judgements from bilingual dictionary users
    Fothergill, Richard James ( 2017)
    Words can take on many meanings, and collecting and identifying example usages representative of the full variety of meanings words can take is a bottleneck to the study of lexical semantics using statistical approaches. To perform supervised word sense disambiguation (WSD), or to evaluate knowledge-based methods, a corpus of texts annotated with senses from a dictionary may be constructed by paid experts. However, the cost usually prohibits more than a small sample of words and senses being represented in the corpus. Crowdsourcing methods promise to acquire data more cheaply, albeit with a greater challenge for quality control. Most crowdsourcing to date has incentivised participation in the form of a payment or by gamification of the resource construction task. However, with paid crowdsourcing the cost of human labour scales linearly with the output size, and while game playing volunteers may be free, gamification studies must compete with a multi-billion dollar games industry for players. In this thesis we develop and evaluate resources for computational semantics, working towards a crowdsourcing method that extracts information from naturally occurring human activities. A number of software products exist for glossing Japanese text with entries from a dictionary for English speaking students. However, the most popular ones have a tendency to either present an overwhelming amount of information containing every sense of every word or else hide too much information and risk removing senses with particular relevance to a specific text. By offering a glossing application with interactive features for exploring word senses, we create an opportunity to crowdsource human judgements about word senses and record human interaction with semantic NLP.
  • Item
    Thumbnail Image
    Coreference resolution for biomedical pathway data
    Choi, Miji Jooyoung ( 2017)
    The study of biological pathways is a major activity in the life sciences. Biological pathways provide understanding and interpretation of many different kinds of biological mechanisms such as metabolism, sending of signals between cells, regulation of gene expression, and production of cells. If there are defects in a pathway, the result may be a disease. Thus, biological pathways are used to support diagnosis of disease, more effective drug prescription, or personalised treatments. Even though there are many pathway resources providing useful information discovered with manual efforts, a great deal of relevant information concerning in such pathways is scattered through the vast biomedical literature. With the growth in the volume of the biomedical literature, many natural language processing methods for automatic information extraction have been studied, but there still exist a variety of challenges such as complex or hidden representations due to the use of coreference expressions in texts. Linguistic expressions such as it, they, or the gene are frequently used by authors to avoid repeating the names of entities or repeating complex descriptions that have previously been introduced in the same text. This thesis addresses three research goals: (1) examining whether an existing coreference resolution approach in the general domain can be adapted to the biomedical domain; (2) investigation of a heuristic strategy for coreference resolution in the biomedical literature; and (3) examining how coreference resolution can improve biological pathway data from the perspectives of information extraction, and of evaluation of existing pathway resources. In this thesis, we propose a new categorical framework that provides detailed analysis of performance of coreference resolution systems, based on analysis of syntactic and semantic characteristics of coreference relations in the biomedical domain. The framework not only can identify weaknesses of existing approaches, but also can provide insights into strategies for further improvement. We propose an approach to biomedical domain-specific coreference resolution that combines a set of syntactically and semantically motivated rules in terms of coreference type. Finally, we demonstrate that coreference resolution is a valuable process for pathway information discovery, through case studies. Our results show that an approach incorporating a coreference resolution process significantly improves information extraction performance.
  • Item
    Thumbnail Image
    Improving the utility of social media with Natural Language Processing
    HAN, BO ( 2014)
    Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of social media data. In particular, text normalisation and geolocation prediction are closely examined in the context of Twitter text processing. Text normalisation is the task of restoring non-standard words to their standard forms. For instance, earthquick and 2morrw should be transformed into “earthquake” and “tomorrow”, respectively. Non-standard words often cause problems for existing tools trained on edited text sources such as newswire text. By applying text normalisation to reduce unknown non-standard words, the accuracy of NLP tools and downstream applications is expected to increase. In this thesis, I explore and develop lexical normalisation methods for Twitter text. I shift the focus of text normalisation from a cascaded token-based approach to a type-based approach using a combined lexicon, based on the analysis of existing and developed text normalisation methods. The type-based method achieved the state-of-the-art end-to-end normalisation accuracy at the time of publication, i.e., 0.847 precision and 0.630 recall on a benchmark dataset. Furthermore, it is simple, lightweight and easily integrable which is particularly well suited to large-scale data processing. Additionally, the effectiveness of the proposed normalisation method is shown in non-English text normalisation and other NLP tasks and applications. Geolocation prediction estimates a user’s primary location based on the text of their posts. It enables location-based data partitioning, which is crucial to a range of tasks and applications such as local event detection. The partitioned location data can improve both the efficiency and the effectiveness of NLP tools and applications. In this thesis, I identify and explore several factors that affect the accuracy of text-based geolocation prediction in a unified framework. In particular, an extensive range of feature selection methods is compared to determine the optimised feature set for the geolocation prediction model. The results suggest feature selection is an effective method for improving the prediction accuracy regardless of geolocation model and location partitioning. Additionally, I examine the influence of other factors including non-geotagged data, user metadata, tweeting language, temporal influence, user geolocatability, and geolocation prediction confidence. The proposed stacking-based prediction model achieved 40.6% city-level accuracy and 40km median error distance for English Twitter users on a recent benchmark dataset. These investigations provide practical insights into the design of a text-based normalisation system, as well as the basis for further research on this task. Overall, the exploration of these two text processing tasks enhances the utility of social media data for relevant NLP tasks and downstream applications. The developed method and experimental results have immediate impact on future social media research.
  • Item
    Thumbnail Image
    Scaling conditional random fields for natural language processing
    Cohn, Trevor A ( 2007-01)
    This thesis deals with the use of Conditional Random Fields (CRFs; Lafferty et al. (2001)) for Natural Language Processing (NLP). CRFs are probabilistic models for sequence labelling which are particularly well suited to NLP. They have many compelling advantages over other popular models such as Hidden Markov Models and Maximum Entropy Markov Models (Rabiner, 1990; McCallum et al., 2001), and have been applied to a number of NLP tasks with considerable success (e.g., Sha and Pereira (2003) and Smith et al. (2005)). Despite their apparent success, CRFs suffer from two main failings. Firstly, they often over-fit the training sample. This is a consequence of their considerable expressive power, and can be limited by a prior over the model parameters (Sha and Pereira, 2003; Peng and McCallum, 2004). Their second failing is that the standard methods for CRF training are often very slow, sometimes requiring weeks of processing time. This efficiency problem is largely ignored in current literature, although in practise the cost of training prevents the application of CRFs to many new more complex tasks, and also prevents the use of densely connected graphs, which would allow for much richer feature sets. (For complete abstract open document)
  • Item
    Thumbnail Image
    Improving the utility of topic models: an uncut gem does not sparkle
    LAU, JEY HAN ( 2013)
    This thesis concerns a type of statistical model known as topic model. Topic modelling learns abstract “topics” in a collection of documents, and by “topic” we mean an idea, theme or subject. For example we may have an article that discusses space exploration, or a book about crime. Space exploration and crime, these two subjects, are the “topics” that we are talking about. As one imagine, topic modelling has a direct application in digital libraries, as it automates the learning and categorisation of topics in books and articles. The merit of topic modelling, however, is that its machinery is not limited to processing just words but symbols in general. As such, topic modelling has seen applications in other areas outside text processing such as biomedical research for inferring protein families. Most applications, however, are small scale and experimental and much of the impact is still contained in academic research. The overarching theme of the thesis is thus to improve the utility of topic modelling. We achieve this in two ways: (1) by improving a few aspects of topic modelling to make it more accessible and usable by users; and (2) by proposing novel applications of topic modelling to real-world problems. In the first step, we look into improving the preprocessing methodology of documents that serves as the creation of input for topic models. We also experiment extensively to improve the visualisation of topics—one of the main output of topic models—to increase its usability for human users. In the second step, we apply topic modelling in a lexicography-oriented work to learn and detect new meanings that have emerged in words and in the social media space to identify popular social trends. Both were novel applications and delivered promising results, demonstrating the strength and wide applicability of topic models.
  • Item
    Thumbnail Image
    Collective document classification using explicit and implicit inter-document relationships
    Burford, Clinton ( 2013)
    Information systems are transforming the ways in which people generate, store and share information. One consequence of this change is a massive increase in the quantity of digital content the average person needs to deal with. A large part of the information systems challenge is about finding intelligent ways to help users locate and analyse this information. One tool that is available to build systems to address this challenge is automatic document classification. A document classifier is a statistical model for predicting a label for an input document that is represented as a set of features. The potential usefulness of such a generalised system for categorising documents based on their contents is very great. There are direct applications for systems that can answer complex document categorisation questions like: Is this product review generally positive or negative? Document classification systems can also become critical parts of most complex systems that need input documents to be selected based on complex criteria. This thesis addresses the question of how document classifiers can exploit information about the relationships between documents being classified. Normally, document classifiers work on a single document at a time: once the classifier has been trained from a set of labelled examples, it can then be used to label single input documents as required. Collective document classifiers learn a classifier that can be applied to a group of related documents. The inter-document relationships in the group are used to improve labelling performance beyond what is possible when considering documents in isolation. Work on collective document classifiers is based on the observation that some types of documents have features which are either ambiguous or not present in training data, but which have the special characteristic of indicating relationships between the labels of documents. Most often, an inter-document relationship indicates that two documents have the same label, but it may also indicate that they have different labels. In either case, classifiers gain an advantage if they can consider inter-document features. Inter-document features can be explicit, as when a document cites or quotes another, or implicit, as when documents exist in semantically related groups in which stylistic, structural or semantic similarities are informative, or when they are related by a spatial or temporal structure. In the first part of this thesis I survey the state-of-the-art in collective document classification and explore approaches for adding collective behaviour to standard document classifiers. I present an experimental evaluation of these techniques for use with explicit inter-document relationships. In the second part I develop techniques for extracting implicit inter-document relationships. In total, the work in this thesis assesses and extends the capabilities of collective document classifiers. Its contribution is in four main parts: (1) I introduce an approach that gives better than state of the art performance for collective classification of political debate transcripts; (2) I provide a comparative overview of collective document classification techniques to assist practitioners in choosing an algorithm for collective document classification tasks; (3) I demonstrate effective and novel approaches for generating collective classifiers from standard classifiers; and (4) I introduce a technique for inferring inter-document relationships based on matching phrases and show that these relationships can be used to improve overall document classification performance.
  • Item
    Thumbnail Image
    The effects of sampling and semantic categories on large-scale supervised relation extraction
    Willy ( 2012)
    The purpose of relation extraction is to identify novel pairs of entities which are related by a pre-specified relation such as hypernym or synonym. The traditional approach to relation extraction is to building a dedicated system for a particular relation, meaning that significant effort is required to repurpose the approach to new relations. We propose a generic approach based on supervised learning, which provides a standardised process for performing relation extraction on different relations and domains. We explore the feasibility of the approach over a range of relations and corpora, focusing particularly on the development of a realistic evaluation methodology for relation extraction. In addition to this, we investigate the impact of semantic categories on extraction effectiveness.
  • Item
    Thumbnail Image
    Computing relationships and relatedness between contextually diverse entities
    GRIESER, KARL ( 2011)
    When presented with a pair of entities such as a ball and a bat, a person may make the connection that both of these entities are involved in sport (e.g., the sports baseball or cricket, based on the individual's background), that the composition of the two entities is similar (e.g., a wooden ball and a wooden stick), or if the person is especially creative, a fancy dress ball where someone has come dressed as a bat. All of these connections are equally valid, but depending on the context the person is familiar with (e.g., sport, wooden objects, fancy dress), a particular connection may be more apparent to that person. From a computational perspective, identifying these relationships and calculating the level of relatedness of entity pairs requires consideration of all ways in which the entities are able to interact with one another. Existing approaches to identifying the relatedness of entities and the semantic relationships that exist between them fail to take into account the multiple diverse ways in which these entities may interact, and hence do not explore all potential ways in which entities may be related. In this thesis, I use the collaborative encyclopedia Wikipedia as the basis for the formulation of a measure of semantic relatedness that takes into account the contextual diversity of entities (called the Related Article Conceptual Overlap, or RACO, method), and describe several methods of relationship extraction that utilise the taxonomic structure of Wikipedia to identify pieces of text that describe relations between contextually diverse entities. I also describe the construction of a dataset of museum exhibit relatedness judgements used to evaluate the performance of RACO. I demonstrate that RACO outperforms state-of-the-art measures of semantic relatedness over a collection of contextually diverse entities (museum exhibits), and that the taxonomic structure of Wikipedia provides a basis for identifying valid relationships between contextually diverse entities. As this work is presented in regard to the domain of Cultural Heritage and using Wikipedia as a basis for representation, I additionally describe the process for adapting the principle of conceptual overlap for calculating semantic relatedness and the relationship extraction methods based on taxonomic links to alternate contextually diverse domains, and for use with other representational resources.
  • Item
    Thumbnail Image
    Orthographic support for passing the reading hurdle in Japanese
    YENCKEN, LARS ( 2010)
    Learning a second language is, for the most part, a day-in day-out struggle against the mountain of new vocabulary a learner must acquire. Furthermore, since the number of new words to learn is so great, learners must acquire them autonomously. Evidence suggests that for languages with writing systems, native-like vocabulary sizes are only developed through reading widely, and that reading is only fruitful once learners have acquired the core vocabulary required for it to become smooth. Learners of Japanese have an especially high barrier in the form of the Japanese writing system, in particular its use of kanji characters. Recent work on dictionary accessibility has focused on compensating for learner errors in pronouncing unknown words, however much difficulty remains. This thesis uses the rich visual nature of the Japanese orthography to support the study of vocabulary in several ways. Firstly, it proposes a range of kanji similarity measures and evaluates them over several new data sets, finding that the stroke edit distance and tree edit distance metrics best approximate human judgements. Secondly, it uses stroke edit distance construct a model of kanji misrecognition, which we use as the basis for a new form of kanji search by similarity. Analysing query logs, we find that this new form of search was rapidly adopted by users, indicating its utility. We finally combine kanji confusion and pronunciation models into a new adaptive testing platform, Kanji Tester, modelled after aspects of the Japanese Language Proficiency Test. As the user tests themselves, the system adapts to their error patterns and uses this information to make future tests more difficult. Investigating logs of use, we find a weak positive correlation between ability estimates and time the system has been used. Furthermore, our adaptive models generated questions which were significantly more difficult than their control counterparts. Overall, these contributions make a concerted effort to improve tools for learner self-study, so that learners can successfully overcome the reading hurdle and propel themselves towards greater proficiency. The data collected from these tools also forms a useful basis for further study of learner error and vocabulary development.