Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 13
  • Item
    Thumbnail Image
    Crowdsourcing lexical semantic judgements from bilingual dictionary users
    Fothergill, Richard James ( 2017)
    Words can take on many meanings, and collecting and identifying example usages representative of the full variety of meanings words can take is a bottleneck to the study of lexical semantics using statistical approaches. To perform supervised word sense disambiguation (WSD), or to evaluate knowledge-based methods, a corpus of texts annotated with senses from a dictionary may be constructed by paid experts. However, the cost usually prohibits more than a small sample of words and senses being represented in the corpus. Crowdsourcing methods promise to acquire data more cheaply, albeit with a greater challenge for quality control. Most crowdsourcing to date has incentivised participation in the form of a payment or by gamification of the resource construction task. However, with paid crowdsourcing the cost of human labour scales linearly with the output size, and while game playing volunteers may be free, gamification studies must compete with a multi-billion dollar games industry for players. In this thesis we develop and evaluate resources for computational semantics, working towards a crowdsourcing method that extracts information from naturally occurring human activities. A number of software products exist for glossing Japanese text with entries from a dictionary for English speaking students. However, the most popular ones have a tendency to either present an overwhelming amount of information containing every sense of every word or else hide too much information and risk removing senses with particular relevance to a specific text. By offering a glossing application with interactive features for exploring word senses, we create an opportunity to crowdsource human judgements about word senses and record human interaction with semantic NLP.
  • Item
    Thumbnail Image
    Supervised algorithms for complex relation extraction
    Khirbat, Gitansh ( 2017)
    Binary relation extraction is an essential component of information extraction systems, wherein the aim is to extract meaningful relations that might exist between a pair of entities within a sentence. Binary relation extraction systems have witnessed a significant improvement over past three decades, ranging from rule-based systems to statistical natural language techniques including supervised, semi-supervised and unsupervised machine learning approaches. Modern question answering and summarization systems have motivated the need for extracting complex relations wherein the number of related entities is more than two. Complex relation extraction (CRE) systems are highly domain specific and often rely on traditional binary relation extraction techniques employed in a pipeline fashion, thus susceptible to processing-induced error propagation. In this thesis, we investigate and develop approaches to extract complex relations directly from natural language text. In particular, we deviate from the traditional disintegration of complex relations into constituent binary relations and propose usage of shortest dependency parse spanning the n related entities as an alternative to facilitate direct CRE. We investigate this proposed approach by a comprehensive study of supervised learning algorithms with a special focus on training support vector machines, convolutional neural networks and deep learning ensemble algorithms. Research in the domain of CRE is stymied by paucity of annotated data. To facilitate future exploration, we create two new datasets to evaluate our proposed CRE approaches on a pilot biographical fact extraction task. An evaluation of results on new and standard datasets concludes that usage of shortest path dependency parse in a supervised setting enables direct CRE with an improved accuracy, beating current state-of-the-art CRE systems. We further show the application of CRE to achieve state-of-the-art performance for directly extracting events without the need of disintegrating them into event trigger and event argument extraction processes.
  • Item
    Thumbnail Image
    Coreference resolution for biomedical pathway data
    Choi, Miji Jooyoung ( 2017)
    The study of biological pathways is a major activity in the life sciences. Biological pathways provide understanding and interpretation of many different kinds of biological mechanisms such as metabolism, sending of signals between cells, regulation of gene expression, and production of cells. If there are defects in a pathway, the result may be a disease. Thus, biological pathways are used to support diagnosis of disease, more effective drug prescription, or personalised treatments. Even though there are many pathway resources providing useful information discovered with manual efforts, a great deal of relevant information concerning in such pathways is scattered through the vast biomedical literature. With the growth in the volume of the biomedical literature, many natural language processing methods for automatic information extraction have been studied, but there still exist a variety of challenges such as complex or hidden representations due to the use of coreference expressions in texts. Linguistic expressions such as it, they, or the gene are frequently used by authors to avoid repeating the names of entities or repeating complex descriptions that have previously been introduced in the same text. This thesis addresses three research goals: (1) examining whether an existing coreference resolution approach in the general domain can be adapted to the biomedical domain; (2) investigation of a heuristic strategy for coreference resolution in the biomedical literature; and (3) examining how coreference resolution can improve biological pathway data from the perspectives of information extraction, and of evaluation of existing pathway resources. In this thesis, we propose a new categorical framework that provides detailed analysis of performance of coreference resolution systems, based on analysis of syntactic and semantic characteristics of coreference relations in the biomedical domain. The framework not only can identify weaknesses of existing approaches, but also can provide insights into strategies for further improvement. We propose an approach to biomedical domain-specific coreference resolution that combines a set of syntactically and semantically motivated rules in terms of coreference type. Finally, we demonstrate that coreference resolution is a valuable process for pathway information discovery, through case studies. Our results show that an approach incorporating a coreference resolution process significantly improves information extraction performance.
  • Item
    Thumbnail Image
    Unsupervised all-words sense distribution learning
    Bennett, Andrew ( 2016)
    There has recently been significant interest in unsupervised methods for learning word sense distributions, or most frequent sense information, in particular for applications where sense distinctions are needed. In addition to their direct application to word sense disambiguation (WSD), particularly where domain adaptation is required, these methods have successfully been applied to diverse problems such as novel sense detection or lexical simplification. Furthermore, they could be used to supplement or replace existing sources of sense frequencies, such as SemCor, which have many significant flaws. However, a major gap in the past work on sense distribution learning is that it has never been optimised for large-scale application to the entire vocabularies of a languages, as would be required to replace sense frequency resources such as SemCor. In this thesis, we develop an unsupervised method for all-words sense distribution learning, which is suitable for language-wide application. We first optimise and extend HDP-WSI, an existing state-of-the-art sense distribution learning method based on HDP topic modelling. This is mostly achieved by replacing HDP with the more efficient HCA topic modelling algorithm in order to create HCA-WSI, which is over an order of magnitude faster than HDP-WSI and more robust. We then apply HCA-WSI across the vocabularies of several languages to create LexSemTm, which is a multilingual sense frequency resource of unprecedented size. Of note, LexSemTm contains sense frequencies for approximately 88% of polysemous lemmas in Princeton WordNet, compared to only 39% for SemCor, and the quality of data in each is shown to be roughly equivalent. Finally, we extend our sense distribution learning methodology to multiword expressions (MWEs), which to the best of our knowledge is a novel task (as is applying any kind of general-purpose WSD methods to MWEs). We demonstrate that sense distribution learning for MWEs is comparable to that for simplex lemmas in all important respects, and we expand LexSemTm with MWE sense frequency data.
  • Item
    Thumbnail Image
    Improving the utility of social media with Natural Language Processing
    HAN, BO ( 2014)
    Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of social media data. In particular, text normalisation and geolocation prediction are closely examined in the context of Twitter text processing. Text normalisation is the task of restoring non-standard words to their standard forms. For instance, earthquick and 2morrw should be transformed into “earthquake” and “tomorrow”, respectively. Non-standard words often cause problems for existing tools trained on edited text sources such as newswire text. By applying text normalisation to reduce unknown non-standard words, the accuracy of NLP tools and downstream applications is expected to increase. In this thesis, I explore and develop lexical normalisation methods for Twitter text. I shift the focus of text normalisation from a cascaded token-based approach to a type-based approach using a combined lexicon, based on the analysis of existing and developed text normalisation methods. The type-based method achieved the state-of-the-art end-to-end normalisation accuracy at the time of publication, i.e., 0.847 precision and 0.630 recall on a benchmark dataset. Furthermore, it is simple, lightweight and easily integrable which is particularly well suited to large-scale data processing. Additionally, the effectiveness of the proposed normalisation method is shown in non-English text normalisation and other NLP tasks and applications. Geolocation prediction estimates a user’s primary location based on the text of their posts. It enables location-based data partitioning, which is crucial to a range of tasks and applications such as local event detection. The partitioned location data can improve both the efficiency and the effectiveness of NLP tools and applications. In this thesis, I identify and explore several factors that affect the accuracy of text-based geolocation prediction in a unified framework. In particular, an extensive range of feature selection methods is compared to determine the optimised feature set for the geolocation prediction model. The results suggest feature selection is an effective method for improving the prediction accuracy regardless of geolocation model and location partitioning. Additionally, I examine the influence of other factors including non-geotagged data, user metadata, tweeting language, temporal influence, user geolocatability, and geolocation prediction confidence. The proposed stacking-based prediction model achieved 40.6% city-level accuracy and 40km median error distance for English Twitter users on a recent benchmark dataset. These investigations provide practical insights into the design of a text-based normalisation system, as well as the basis for further research on this task. Overall, the exploration of these two text processing tasks enhances the utility of social media data for relevant NLP tasks and downstream applications. The developed method and experimental results have immediate impact on future social media research.
  • Item
    Thumbnail Image
    Improving the efficiency and capabilities of document structuring
    MARSHALL, ROBERT ( 2007)
    Natural language generation (NLG), the problem of creating human-readable documents by computer, is one of the major fields of research in computational linguistics The task of creating a document is extremely common in many fields of activity. Accordingly, there are many potential applications for NLG - almost any document creation task could potentially be automated by an NLG system. Advanced forms of NLG could also be used to generate a document in multiple languages, or as an output interface for other programs, which might ordinarily produce a less-manageable collection of data. They may also be able to create documents tailored to the needs of individual users. This thesis deals with document structure, a recent theory which describes those aspects of a document’s layout which affect its meaning. As well as its theoretical interest, it is a useful intermediate representation in the process of NLG. There is a well-defined process for generating a document structure using constraint programming. We show how this process can be made considerably more efficient. This in turn allows us to extend the document structuring task to allow for summarisation and finer control of the document layout. This thesis is organised as follows. Firstly, we review the necessary background material in both natural language processing and constraint programming.
  • Item
    Thumbnail Image
    Scaling conditional random fields for natural language processing
    Cohn, Trevor A ( 2007-01)
    This thesis deals with the use of Conditional Random Fields (CRFs; Lafferty et al. (2001)) for Natural Language Processing (NLP). CRFs are probabilistic models for sequence labelling which are particularly well suited to NLP. They have many compelling advantages over other popular models such as Hidden Markov Models and Maximum Entropy Markov Models (Rabiner, 1990; McCallum et al., 2001), and have been applied to a number of NLP tasks with considerable success (e.g., Sha and Pereira (2003) and Smith et al. (2005)). Despite their apparent success, CRFs suffer from two main failings. Firstly, they often over-fit the training sample. This is a consequence of their considerable expressive power, and can be limited by a prior over the model parameters (Sha and Pereira, 2003; Peng and McCallum, 2004). Their second failing is that the standard methods for CRF training are often very slow, sometimes requiring weeks of processing time. This efficiency problem is largely ignored in current literature, although in practise the cost of training prevents the application of CRFs to many new more complex tasks, and also prevents the use of densely connected graphs, which would allow for much richer feature sets. (For complete abstract open document)
  • Item
    Thumbnail Image
    Automatic identification of locative expressions from informal text
    Liu, Fei ( 2013)
    Informal place descriptions that are rich in locative expressions can be found in various contexts. The ability to extract locative expressions from such informal place descriptions is at the centre of improving the quality of services, such as interpreting geographical queries and emergency calls. While much attention has been focused on the identification of formal place references (e.g., Rathmines Road) from natu- ral language, people tend to make heavy use of informal place references (e.g., my bedroom). This research addresses the problem by developing a model that is able to automatically identify locative expressions from informal text. Moreover, we study and discover insights of what aspects are helpful in the identification task. Utilising an existing manually annotated corpus, we re-annotate locative expressions and use them as the gold standard. Having the gold standard ready, we take a machine learning approach to the identification task with well-reasoned features based on observation and intuition. Further, we study the impacts of various feature setups on the performance of the model and provide analyses of experiment results. With the best performing feature setup, the model is able to achieve significant increase in performance over the baseline systems.
  • Item
    Thumbnail Image
    Improving the utility of topic models: an uncut gem does not sparkle
    LAU, JEY HAN ( 2013)
    This thesis concerns a type of statistical model known as topic model. Topic modelling learns abstract “topics” in a collection of documents, and by “topic” we mean an idea, theme or subject. For example we may have an article that discusses space exploration, or a book about crime. Space exploration and crime, these two subjects, are the “topics” that we are talking about. As one imagine, topic modelling has a direct application in digital libraries, as it automates the learning and categorisation of topics in books and articles. The merit of topic modelling, however, is that its machinery is not limited to processing just words but symbols in general. As such, topic modelling has seen applications in other areas outside text processing such as biomedical research for inferring protein families. Most applications, however, are small scale and experimental and much of the impact is still contained in academic research. The overarching theme of the thesis is thus to improve the utility of topic modelling. We achieve this in two ways: (1) by improving a few aspects of topic modelling to make it more accessible and usable by users; and (2) by proposing novel applications of topic modelling to real-world problems. In the first step, we look into improving the preprocessing methodology of documents that serves as the creation of input for topic models. We also experiment extensively to improve the visualisation of topics—one of the main output of topic models—to increase its usability for human users. In the second step, we apply topic modelling in a lexicography-oriented work to learn and detect new meanings that have emerged in words and in the social media space to identify popular social trends. Both were novel applications and delivered promising results, demonstrating the strength and wide applicability of topic models.
  • Item
    Thumbnail Image
    Collective document classification using explicit and implicit inter-document relationships
    Burford, Clinton ( 2013)
    Information systems are transforming the ways in which people generate, store and share information. One consequence of this change is a massive increase in the quantity of digital content the average person needs to deal with. A large part of the information systems challenge is about finding intelligent ways to help users locate and analyse this information. One tool that is available to build systems to address this challenge is automatic document classification. A document classifier is a statistical model for predicting a label for an input document that is represented as a set of features. The potential usefulness of such a generalised system for categorising documents based on their contents is very great. There are direct applications for systems that can answer complex document categorisation questions like: Is this product review generally positive or negative? Document classification systems can also become critical parts of most complex systems that need input documents to be selected based on complex criteria. This thesis addresses the question of how document classifiers can exploit information about the relationships between documents being classified. Normally, document classifiers work on a single document at a time: once the classifier has been trained from a set of labelled examples, it can then be used to label single input documents as required. Collective document classifiers learn a classifier that can be applied to a group of related documents. The inter-document relationships in the group are used to improve labelling performance beyond what is possible when considering documents in isolation. Work on collective document classifiers is based on the observation that some types of documents have features which are either ambiguous or not present in training data, but which have the special characteristic of indicating relationships between the labels of documents. Most often, an inter-document relationship indicates that two documents have the same label, but it may also indicate that they have different labels. In either case, classifiers gain an advantage if they can consider inter-document features. Inter-document features can be explicit, as when a document cites or quotes another, or implicit, as when documents exist in semantically related groups in which stylistic, structural or semantic similarities are informative, or when they are related by a spatial or temporal structure. In the first part of this thesis I survey the state-of-the-art in collective document classification and explore approaches for adding collective behaviour to standard document classifiers. I present an experimental evaluation of these techniques for use with explicit inter-document relationships. In the second part I develop techniques for extracting implicit inter-document relationships. In total, the work in this thesis assesses and extends the capabilities of collective document classifiers. Its contribution is in four main parts: (1) I introduce an approach that gives better than state of the art performance for collective classification of political debate transcripts; (2) I provide a comparative overview of collective document classification techniques to assist practitioners in choosing an algorithm for collective document classification tasks; (3) I demonstrate effective and novel approaches for generating collective classifiers from standard classifiers; and (4) I introduce a technique for inferring inter-document relationships based on matching phrases and show that these relationships can be used to improve overall document classification performance.