Show simple item record

dc.contributor.authorLi, Yitong
dc.date.accessioned2020-06-30T02:15:19Z
dc.date.available2020-06-30T02:15:19Z
dc.date.issued2019
dc.identifier.urihttp://hdl.handle.net/11343/240992
dc.description© 2019 Yitong Li
dc.description.abstractThere are many challenges in building robust natural language applications. Machine learning based methods require large volumes of annotated text data, and variations over text can lead to problems, namely: (1) language can be highly variable and expressed with different variations, such as lexical and syntactic. Robust models should be able to handle these variations. (2) A text corpus is heterogeneous, often making language systems domain-brittle. Solutions for domain adaptation and training with corpora comprised of multiple domains are required for language applications in the real world. (3) Many language applications tend to be biased to the demographic of the authors of documents the system is trained on, and lack model fairness. Demographic bias also causes privacy issues when a model is made available to others. In this thesis, I aim to build robust natural language models to tackle these problems, focusing on deep learning approaches which have shown great success in language processing via representation learning. I pose three basic research questions: how to learn representations that are robust to language variation, robust to domain variation, and robust to demographic variables. Each of these research questions is tackled using different approaches, including data augmentation, adversarial learning, and variational inference. For learning robust representations to language variation, I study lexical variation and syntactic variation. To be specific, a regularisation method is proposed to tackle lexical variation, and a data augmentation method is proposed to build robust models, using a range of language generation methods from both linguistic and machine learning perspectives. For domain robustness, I focus on multi-domain learning and investigate domain supervised and unsupervised learning, where domain labels may or may not be available. Two types of models are proposed, via adversarial learning and latent domain gating, to build robust models for heterogeneous text. For robustness to demographics, I show that demographic bias in the training corpus leads to model fairness problems with respect to the demographic of the authors, as well as privacy issues under inference attacks. Adversarial learning is adopted to mitigate bias in representation learning, to improve model fairness and privacy-preservation. To demonstrate the proposed approaches, a range of tasks are considered, including text classification and POS tagging. To evaluate the generalisation and robustness, both in-domain and out-of-domain experiments are conducted with two classes of language tasks: text classification and part-of-speech tagging. For multi-domain learning, multi-domain language identification and multi-domain sentiment classification are conducted, and I simulate domain supervised learning and domain unsupervised learning to evaluate domain robustness. I evaluate model fairness with different demographic attributes and apply inference attacks to test model privacy. The experiments show the advantages and the robustness of the proposed methods. Finally, I discuss the relations between the different forms of robustness, including their commonalities and differences. The limitations of this thesis are discussed in detail, including potential methods to address these shortcomings in future work, and potential opportunities to generalise the proposed methods to other language tasks. Above all, these methods of learning robust representations can contribute towards progress in natural language processing.
dc.rightsTerms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.
dc.subjectnatural language processing
dc.subjectdeep learning
dc.subjectrobust representation
dc.subjectdata augmentation
dc.subjectlanguage variation
dc.subjectdomain adaptation
dc.subjectmulti-domain learning
dc.subjectadversarial learning
dc.subjectdemographic variables
dc.subjectfairness
dc.subjectvariational inference
dc.subjectmodel privacy
dc.subjectprivacy-preserving
dc.titleTowards Robust Representation of Natural Language Processing
dc.typePhD thesis
melbourne.affiliation.departmentComputing and Information Systems
melbourne.affiliation.facultyEngineering
melbourne.thesis.supervisornameTimothy Baldwin
melbourne.contributor.authorLi, Yitong
melbourne.thesis.supervisorothernameTrevor Cohn
melbourne.tes.fieldofresearch1080107 Natural Language Processing
melbourne.tes.fieldofresearch2080199 Artificial Intelligence and Image Processing not elsewhere classified
melbourne.tes.confirmedtrue
melbourne.accessrightsOpen Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record