Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 13
  • Item
    Thumbnail Image
    Towards Robust Representation of Natural Language Processing
    Li, Yitong ( 2019)
    There are many challenges in building robust natural language applications. Machine learning based methods require large volumes of annotated text data, and variations over text can lead to problems, namely: (1) language can be highly variable and expressed with different variations, such as lexical and syntactic. Robust models should be able to handle these variations. (2) A text corpus is heterogeneous, often making language systems domain-brittle. Solutions for domain adaptation and training with corpora comprised of multiple domains are required for language applications in the real world. (3) Many language applications tend to be biased to the demographic of the authors of documents the system is trained on, and lack model fairness. Demographic bias also causes privacy issues when a model is made available to others. In this thesis, I aim to build robust natural language models to tackle these problems, focusing on deep learning approaches which have shown great success in language processing via representation learning. I pose three basic research questions: how to learn representations that are robust to language variation, robust to domain variation, and robust to demographic variables. Each of these research questions is tackled using different approaches, including data augmentation, adversarial learning, and variational inference. For learning robust representations to language variation, I study lexical variation and syntactic variation. To be specific, a regularisation method is proposed to tackle lexical variation, and a data augmentation method is proposed to build robust models, using a range of language generation methods from both linguistic and machine learning perspectives. For domain robustness, I focus on multi-domain learning and investigate domain supervised and unsupervised learning, where domain labels may or may not be available. Two types of models are proposed, via adversarial learning and latent domain gating, to build robust models for heterogeneous text. For robustness to demographics, I show that demographic bias in the training corpus leads to model fairness problems with respect to the demographic of the authors, as well as privacy issues under inference attacks. Adversarial learning is adopted to mitigate bias in representation learning, to improve model fairness and privacy-preservation. To demonstrate the proposed approaches, a range of tasks are considered, including text classification and POS tagging. To evaluate the generalisation and robustness, both in-domain and out-of-domain experiments are conducted with two classes of language tasks: text classification and part-of-speech tagging. For multi-domain learning, multi-domain language identification and multi-domain sentiment classification are conducted, and I simulate domain supervised learning and domain unsupervised learning to evaluate domain robustness. I evaluate model fairness with different demographic attributes and apply inference attacks to test model privacy. The experiments show the advantages and the robustness of the proposed methods. Finally, I discuss the relations between the different forms of robustness, including their commonalities and differences. The limitations of this thesis are discussed in detail, including potential methods to address these shortcomings in future work, and potential opportunities to generalise the proposed methods to other language tasks. Above all, these methods of learning robust representations can contribute towards progress in natural language processing.
  • Item
    Thumbnail Image
    Memory-augmented neural networks for better discourse understanding
    Liu, Fei ( 2019)
    Discourse understanding, due to the multi-sentence nature of discourse, requires consideration of larger contexts, capturing long-range dependencies, and modelling the interactions of entities. While conventional models are unable to keep information stably over long timescales, memory-augmented models are better capable of storing and accessing knowledge, making them well-suited for discourse. In this thesis, we introduce a number of methods for improving memory-augmented models to better understand discourse, validating the utility of memory and establishing a firm base for future studies to build upon.
  • Item
    Thumbnail Image
    On the use of prior and external knowledge in neural sequence models
    Hoang, Cong Duy Vu ( 2019)
    Neural sequence models have recently achieved great success across various natural language processing tasks. In practice, neural sequence models require massive amount of annotated training data to reach their desirable performance; however, there will not always be available data across languages, domains or tasks at hand. Prior and external knowledge provides additional contextual information, potentially improving the modelling performance as well as compensating the lack of large training data, particular in low-resourced situations. In this thesis, we investigate the usefulness of utilising prior and external knowledge for improving neural sequence models. We propose the use of various kinds of prior and external knowledge and present different approaches for integrating them into both training and inference phases of neural sequence models. The followings are main contributions of this thesis which are summarised in two major parts: We present the first part of this thesis which is on Training and Modelling for neural sequence models. In this part, we investigate different situations (particularly in low resource settings) in which prior and external knowledge, such as side information, linguistic factors, monolingual data, is shown to have great benefits for improving performance of neural sequence models. In addition, we introduce a new means for incorporating prior and external knowledge based on the moment matching framework. This framework serves its purpose for exploiting prior and external knowledge as global features of generated sequences in neural sequence models in order to improve the overall quality of the desired output sequence. The second part is about Decoding of neural sequence models in which we propose a novel decoding framework with relaxed continuous optimisation in order to address one of the drawbacks of existing approximate decoding methods, namely the limited ability to incorporate global factors due to intractable search. We hope that this PhD thesis, constituted by two above major parts, will shed light on the use of prior and external knowledge in neural sequence models, both in their training and decoding phases.
  • Item
    Thumbnail Image
    Compositional morphology through deep learning
    Vylomova, Ekaterina ( 2018)
    Most human languages have sophisticated morphological systems. In order to build successful models of language processing, we need to focus on morphology, the internal structure of words. In this thesis, we study two morphological processes: inflection (word change rules, e.g. run -- runs) and derivation (word formation rules, e.g. run -- runner). We first evaluate the ability of contemporary models that are trained using the distributional hypothesis, which states that a word's meaning can be expressed by the context in which it appears, to capture these types of morphology. Our study reveals that inflections are predicted at high accuracy whereas derivations are more challenging due to irregularity of meaning change. We then demonstrate that supplying the model with character-level information improves predictions and makes usage of language resources more efficient, especially in morphologically rich languages. We then address the question of to what extent and which information about word properties (such as gender, case, number) can be predicted entirely from a word's sentential content. To this end, we introduce a novel task of contextual inflection prediction. Our experiments on prediction of morphological features and a corresponding word form from sentential context show that the task is challenging, and as morphological complexity increases, performance significantly drops. We found that some morphological categories (e.g., verbal tense) are inherent and typically cannot be predicted from context while others (e.g., adjective number and gender) are contextual and inferred from agreement. Compared to morphological inflection tasks, where morphological features are explicitly provided, and the system has to predict only the form, accuracy on this task is much lower. Finally, we turn to word formation, derivation. Experiments with derivations show that they are less regular and systematic. We study how much a sentential context is indicative of a meaning change type. Our results suggest that even though inflections are more productive and regular than derivations, the latter also present cases of high regularity of meaning and form change, but often require extra information such as etymology, word frequency, and more fine-grained annotation in order to be predicted at high accuracy.
  • Item
    Thumbnail Image
    Analysing the interplay of location, language and links utilising geotagged Twitter content
    Afshin, Rahimi ( 2018)
    Language use and interactions on social media are geographically biased. In this work we utilise this bias in predictive models of user geolocation and lexical dialectology. User geolocation is an important component of applications such as personalised search and recommendation systems. We propose text-based and network-based geolocation models, and compare them over benchmark datasets yielding state-of-the- art performance. We also propose hybrid and joint text and network geolocation models that improve upon text or network only models and show that the joint models are able to achieve reasonable performance in minimal supervision scenarios, as often happens in real world datasets. Finally, we also propose the use of continuous representations of location, which enables regression modelling of geolocation and lexical dialectology. We show that our proposed data-driven lexical dialectology model provides qualitative insights in studying geographical lexical variation.
  • Item
    Thumbnail Image
    Crowdsourcing lexical semantic judgements from bilingual dictionary users
    Fothergill, Richard James ( 2017)
    Words can take on many meanings, and collecting and identifying example usages representative of the full variety of meanings words can take is a bottleneck to the study of lexical semantics using statistical approaches. To perform supervised word sense disambiguation (WSD), or to evaluate knowledge-based methods, a corpus of texts annotated with senses from a dictionary may be constructed by paid experts. However, the cost usually prohibits more than a small sample of words and senses being represented in the corpus. Crowdsourcing methods promise to acquire data more cheaply, albeit with a greater challenge for quality control. Most crowdsourcing to date has incentivised participation in the form of a payment or by gamification of the resource construction task. However, with paid crowdsourcing the cost of human labour scales linearly with the output size, and while game playing volunteers may be free, gamification studies must compete with a multi-billion dollar games industry for players. In this thesis we develop and evaluate resources for computational semantics, working towards a crowdsourcing method that extracts information from naturally occurring human activities. A number of software products exist for glossing Japanese text with entries from a dictionary for English speaking students. However, the most popular ones have a tendency to either present an overwhelming amount of information containing every sense of every word or else hide too much information and risk removing senses with particular relevance to a specific text. By offering a glossing application with interactive features for exploring word senses, we create an opportunity to crowdsource human judgements about word senses and record human interaction with semantic NLP.
  • Item
    Thumbnail Image
    Coreference resolution for biomedical pathway data
    Choi, Miji Jooyoung ( 2017)
    The study of biological pathways is a major activity in the life sciences. Biological pathways provide understanding and interpretation of many different kinds of biological mechanisms such as metabolism, sending of signals between cells, regulation of gene expression, and production of cells. If there are defects in a pathway, the result may be a disease. Thus, biological pathways are used to support diagnosis of disease, more effective drug prescription, or personalised treatments. Even though there are many pathway resources providing useful information discovered with manual efforts, a great deal of relevant information concerning in such pathways is scattered through the vast biomedical literature. With the growth in the volume of the biomedical literature, many natural language processing methods for automatic information extraction have been studied, but there still exist a variety of challenges such as complex or hidden representations due to the use of coreference expressions in texts. Linguistic expressions such as it, they, or the gene are frequently used by authors to avoid repeating the names of entities or repeating complex descriptions that have previously been introduced in the same text. This thesis addresses three research goals: (1) examining whether an existing coreference resolution approach in the general domain can be adapted to the biomedical domain; (2) investigation of a heuristic strategy for coreference resolution in the biomedical literature; and (3) examining how coreference resolution can improve biological pathway data from the perspectives of information extraction, and of evaluation of existing pathway resources. In this thesis, we propose a new categorical framework that provides detailed analysis of performance of coreference resolution systems, based on analysis of syntactic and semantic characteristics of coreference relations in the biomedical domain. The framework not only can identify weaknesses of existing approaches, but also can provide insights into strategies for further improvement. We propose an approach to biomedical domain-specific coreference resolution that combines a set of syntactically and semantically motivated rules in terms of coreference type. Finally, we demonstrate that coreference resolution is a valuable process for pathway information discovery, through case studies. Our results show that an approach incorporating a coreference resolution process significantly improves information extraction performance.
  • Item
    Thumbnail Image
    Improving the utility of social media with Natural Language Processing
    HAN, BO ( 2014)
    Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of social media data. In particular, text normalisation and geolocation prediction are closely examined in the context of Twitter text processing. Text normalisation is the task of restoring non-standard words to their standard forms. For instance, earthquick and 2morrw should be transformed into “earthquake” and “tomorrow”, respectively. Non-standard words often cause problems for existing tools trained on edited text sources such as newswire text. By applying text normalisation to reduce unknown non-standard words, the accuracy of NLP tools and downstream applications is expected to increase. In this thesis, I explore and develop lexical normalisation methods for Twitter text. I shift the focus of text normalisation from a cascaded token-based approach to a type-based approach using a combined lexicon, based on the analysis of existing and developed text normalisation methods. The type-based method achieved the state-of-the-art end-to-end normalisation accuracy at the time of publication, i.e., 0.847 precision and 0.630 recall on a benchmark dataset. Furthermore, it is simple, lightweight and easily integrable which is particularly well suited to large-scale data processing. Additionally, the effectiveness of the proposed normalisation method is shown in non-English text normalisation and other NLP tasks and applications. Geolocation prediction estimates a user’s primary location based on the text of their posts. It enables location-based data partitioning, which is crucial to a range of tasks and applications such as local event detection. The partitioned location data can improve both the efficiency and the effectiveness of NLP tools and applications. In this thesis, I identify and explore several factors that affect the accuracy of text-based geolocation prediction in a unified framework. In particular, an extensive range of feature selection methods is compared to determine the optimised feature set for the geolocation prediction model. The results suggest feature selection is an effective method for improving the prediction accuracy regardless of geolocation model and location partitioning. Additionally, I examine the influence of other factors including non-geotagged data, user metadata, tweeting language, temporal influence, user geolocatability, and geolocation prediction confidence. The proposed stacking-based prediction model achieved 40.6% city-level accuracy and 40km median error distance for English Twitter users on a recent benchmark dataset. These investigations provide practical insights into the design of a text-based normalisation system, as well as the basis for further research on this task. Overall, the exploration of these two text processing tasks enhances the utility of social media data for relevant NLP tasks and downstream applications. The developed method and experimental results have immediate impact on future social media research.
  • Item
    Thumbnail Image
    Improving the utility of topic models: an uncut gem does not sparkle
    LAU, JEY HAN ( 2013)
    This thesis concerns a type of statistical model known as topic model. Topic modelling learns abstract “topics” in a collection of documents, and by “topic” we mean an idea, theme or subject. For example we may have an article that discusses space exploration, or a book about crime. Space exploration and crime, these two subjects, are the “topics” that we are talking about. As one imagine, topic modelling has a direct application in digital libraries, as it automates the learning and categorisation of topics in books and articles. The merit of topic modelling, however, is that its machinery is not limited to processing just words but symbols in general. As such, topic modelling has seen applications in other areas outside text processing such as biomedical research for inferring protein families. Most applications, however, are small scale and experimental and much of the impact is still contained in academic research. The overarching theme of the thesis is thus to improve the utility of topic modelling. We achieve this in two ways: (1) by improving a few aspects of topic modelling to make it more accessible and usable by users; and (2) by proposing novel applications of topic modelling to real-world problems. In the first step, we look into improving the preprocessing methodology of documents that serves as the creation of input for topic models. We also experiment extensively to improve the visualisation of topics—one of the main output of topic models—to increase its usability for human users. In the second step, we apply topic modelling in a lexicography-oriented work to learn and detect new meanings that have emerged in words and in the social media space to identify popular social trends. Both were novel applications and delivered promising results, demonstrating the strength and wide applicability of topic models.
  • Item
    Thumbnail Image
    Collective document classification using explicit and implicit inter-document relationships
    Burford, Clinton ( 2013)
    Information systems are transforming the ways in which people generate, store and share information. One consequence of this change is a massive increase in the quantity of digital content the average person needs to deal with. A large part of the information systems challenge is about finding intelligent ways to help users locate and analyse this information. One tool that is available to build systems to address this challenge is automatic document classification. A document classifier is a statistical model for predicting a label for an input document that is represented as a set of features. The potential usefulness of such a generalised system for categorising documents based on their contents is very great. There are direct applications for systems that can answer complex document categorisation questions like: Is this product review generally positive or negative? Document classification systems can also become critical parts of most complex systems that need input documents to be selected based on complex criteria. This thesis addresses the question of how document classifiers can exploit information about the relationships between documents being classified. Normally, document classifiers work on a single document at a time: once the classifier has been trained from a set of labelled examples, it can then be used to label single input documents as required. Collective document classifiers learn a classifier that can be applied to a group of related documents. The inter-document relationships in the group are used to improve labelling performance beyond what is possible when considering documents in isolation. Work on collective document classifiers is based on the observation that some types of documents have features which are either ambiguous or not present in training data, but which have the special characteristic of indicating relationships between the labels of documents. Most often, an inter-document relationship indicates that two documents have the same label, but it may also indicate that they have different labels. In either case, classifiers gain an advantage if they can consider inter-document features. Inter-document features can be explicit, as when a document cites or quotes another, or implicit, as when documents exist in semantically related groups in which stylistic, structural or semantic similarities are informative, or when they are related by a spatial or temporal structure. In the first part of this thesis I survey the state-of-the-art in collective document classification and explore approaches for adding collective behaviour to standard document classifiers. I present an experimental evaluation of these techniques for use with explicit inter-document relationships. In the second part I develop techniques for extracting implicit inter-document relationships. In total, the work in this thesis assesses and extends the capabilities of collective document classifiers. Its contribution is in four main parts: (1) I introduce an approach that gives better than state of the art performance for collective classification of political debate transcripts; (2) I provide a comparative overview of collective document classification techniques to assist practitioners in choosing an algorithm for collective document classification tasks; (3) I demonstrate effective and novel approaches for generating collective classifiers from standard classifiers; and (4) I introduce a technique for inferring inter-document relationships based on matching phrases and show that these relationships can be used to improve overall document classification performance.