|dc.description.abstract||There are many challenges in building robust natural language applications.
Machine learning based methods require large volumes of annotated text data, and variations over text can lead to problems, namely:
(1) language can be highly variable and expressed with different variations, such as lexical and syntactic.
Robust models should be able to handle these variations.
(2) A text corpus is heterogeneous, often making language systems domain-brittle.
Solutions for domain adaptation and training with corpora comprised of multiple domains are required for language applications in the real world.
(3) Many language applications tend to be biased to the demographic of the authors of documents the system is trained on, and lack model fairness.
Demographic bias also causes privacy issues when a model is made available to others.
In this thesis, I aim to build robust natural language models to tackle these problems, focusing on deep learning approaches which have shown great success in language processing via representation learning.
I pose three basic research questions:
how to learn representations that are robust to language variation, robust to domain variation, and robust to demographic variables.
Each of these research questions is tackled using different approaches, including data augmentation, adversarial learning, and variational inference.
For learning robust representations to language variation, I study lexical variation and syntactic variation.
To be specific, a regularisation method is proposed to tackle lexical variation, and a data augmentation method is proposed to build robust models, using a range of language generation methods from both linguistic and machine learning perspectives.
For domain robustness, I focus on multi-domain learning and investigate domain supervised and unsupervised learning, where domain labels may or may not be available.
Two types of models are proposed, via adversarial learning and latent domain gating, to build robust models for heterogeneous text.
For robustness to demographics, I show that demographic bias in the training corpus leads to model fairness problems with respect to the demographic of the authors, as well as privacy issues under inference attacks.
Adversarial learning is adopted to mitigate bias in representation learning, to improve model fairness and privacy-preservation.
To demonstrate the proposed approaches, a range of tasks are considered, including text classification and POS tagging.
To evaluate the generalisation and robustness, both in-domain and out-of-domain experiments are conducted with two classes of language tasks: text classification and part-of-speech tagging.
For multi-domain learning, multi-domain language identification and multi-domain sentiment classification are conducted, and I simulate domain supervised learning and domain unsupervised learning to evaluate domain robustness.
I evaluate model fairness with different demographic attributes and apply inference attacks to test model privacy.
The experiments show the advantages and the robustness of the proposed methods.
Finally, I discuss the relations between the different forms of robustness, including their commonalities and differences.
The limitations of this thesis are discussed in detail, including potential methods to address these shortcomings in future work, and potential opportunities to generalise the proposed methods to other language tasks.
Above all, these methods of learning robust representations can contribute towards progress in natural language processing.||