Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Towards Robust Representation of Natural Language Processing
    Li, Yitong ( 2019)
    There are many challenges in building robust natural language applications. Machine learning based methods require large volumes of annotated text data, and variations over text can lead to problems, namely: (1) language can be highly variable and expressed with different variations, such as lexical and syntactic. Robust models should be able to handle these variations. (2) A text corpus is heterogeneous, often making language systems domain-brittle. Solutions for domain adaptation and training with corpora comprised of multiple domains are required for language applications in the real world. (3) Many language applications tend to be biased to the demographic of the authors of documents the system is trained on, and lack model fairness. Demographic bias also causes privacy issues when a model is made available to others. In this thesis, I aim to build robust natural language models to tackle these problems, focusing on deep learning approaches which have shown great success in language processing via representation learning. I pose three basic research questions: how to learn representations that are robust to language variation, robust to domain variation, and robust to demographic variables. Each of these research questions is tackled using different approaches, including data augmentation, adversarial learning, and variational inference. For learning robust representations to language variation, I study lexical variation and syntactic variation. To be specific, a regularisation method is proposed to tackle lexical variation, and a data augmentation method is proposed to build robust models, using a range of language generation methods from both linguistic and machine learning perspectives. For domain robustness, I focus on multi-domain learning and investigate domain supervised and unsupervised learning, where domain labels may or may not be available. Two types of models are proposed, via adversarial learning and latent domain gating, to build robust models for heterogeneous text. For robustness to demographics, I show that demographic bias in the training corpus leads to model fairness problems with respect to the demographic of the authors, as well as privacy issues under inference attacks. Adversarial learning is adopted to mitigate bias in representation learning, to improve model fairness and privacy-preservation. To demonstrate the proposed approaches, a range of tasks are considered, including text classification and POS tagging. To evaluate the generalisation and robustness, both in-domain and out-of-domain experiments are conducted with two classes of language tasks: text classification and part-of-speech tagging. For multi-domain learning, multi-domain language identification and multi-domain sentiment classification are conducted, and I simulate domain supervised learning and domain unsupervised learning to evaluate domain robustness. I evaluate model fairness with different demographic attributes and apply inference attacks to test model privacy. The experiments show the advantages and the robustness of the proposed methods. Finally, I discuss the relations between the different forms of robustness, including their commonalities and differences. The limitations of this thesis are discussed in detail, including potential methods to address these shortcomings in future work, and potential opportunities to generalise the proposed methods to other language tasks. Above all, these methods of learning robust representations can contribute towards progress in natural language processing.
  • Item
    Thumbnail Image
    Analysing the interplay of location, language and links utilising geotagged Twitter content
    Afshin, Rahimi ( 2018)
    Language use and interactions on social media are geographically biased. In this work we utilise this bias in predictive models of user geolocation and lexical dialectology. User geolocation is an important component of applications such as personalised search and recommendation systems. We propose text-based and network-based geolocation models, and compare them over benchmark datasets yielding state-of-the- art performance. We also propose hybrid and joint text and network geolocation models that improve upon text or network only models and show that the joint models are able to achieve reasonable performance in minimal supervision scenarios, as often happens in real world datasets. Finally, we also propose the use of continuous representations of location, which enables regression modelling of geolocation and lexical dialectology. We show that our proposed data-driven lexical dialectology model provides qualitative insights in studying geographical lexical variation.