Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 4 of 4
  • Item
    Thumbnail Image
    Towards Robust Representation of Natural Language Processing
    Li, Yitong ( 2019)
    There are many challenges in building robust natural language applications. Machine learning based methods require large volumes of annotated text data, and variations over text can lead to problems, namely: (1) language can be highly variable and expressed with different variations, such as lexical and syntactic. Robust models should be able to handle these variations. (2) A text corpus is heterogeneous, often making language systems domain-brittle. Solutions for domain adaptation and training with corpora comprised of multiple domains are required for language applications in the real world. (3) Many language applications tend to be biased to the demographic of the authors of documents the system is trained on, and lack model fairness. Demographic bias also causes privacy issues when a model is made available to others. In this thesis, I aim to build robust natural language models to tackle these problems, focusing on deep learning approaches which have shown great success in language processing via representation learning. I pose three basic research questions: how to learn representations that are robust to language variation, robust to domain variation, and robust to demographic variables. Each of these research questions is tackled using different approaches, including data augmentation, adversarial learning, and variational inference. For learning robust representations to language variation, I study lexical variation and syntactic variation. To be specific, a regularisation method is proposed to tackle lexical variation, and a data augmentation method is proposed to build robust models, using a range of language generation methods from both linguistic and machine learning perspectives. For domain robustness, I focus on multi-domain learning and investigate domain supervised and unsupervised learning, where domain labels may or may not be available. Two types of models are proposed, via adversarial learning and latent domain gating, to build robust models for heterogeneous text. For robustness to demographics, I show that demographic bias in the training corpus leads to model fairness problems with respect to the demographic of the authors, as well as privacy issues under inference attacks. Adversarial learning is adopted to mitigate bias in representation learning, to improve model fairness and privacy-preservation. To demonstrate the proposed approaches, a range of tasks are considered, including text classification and POS tagging. To evaluate the generalisation and robustness, both in-domain and out-of-domain experiments are conducted with two classes of language tasks: text classification and part-of-speech tagging. For multi-domain learning, multi-domain language identification and multi-domain sentiment classification are conducted, and I simulate domain supervised learning and domain unsupervised learning to evaluate domain robustness. I evaluate model fairness with different demographic attributes and apply inference attacks to test model privacy. The experiments show the advantages and the robustness of the proposed methods. Finally, I discuss the relations between the different forms of robustness, including their commonalities and differences. The limitations of this thesis are discussed in detail, including potential methods to address these shortcomings in future work, and potential opportunities to generalise the proposed methods to other language tasks. Above all, these methods of learning robust representations can contribute towards progress in natural language processing.
  • Item
    Thumbnail Image
    Compositional morphology through deep learning
    Vylomova, Ekaterina ( 2018)
    Most human languages have sophisticated morphological systems. In order to build successful models of language processing, we need to focus on morphology, the internal structure of words. In this thesis, we study two morphological processes: inflection (word change rules, e.g. run -- runs) and derivation (word formation rules, e.g. run -- runner). We first evaluate the ability of contemporary models that are trained using the distributional hypothesis, which states that a word's meaning can be expressed by the context in which it appears, to capture these types of morphology. Our study reveals that inflections are predicted at high accuracy whereas derivations are more challenging due to irregularity of meaning change. We then demonstrate that supplying the model with character-level information improves predictions and makes usage of language resources more efficient, especially in morphologically rich languages. We then address the question of to what extent and which information about word properties (such as gender, case, number) can be predicted entirely from a word's sentential content. To this end, we introduce a novel task of contextual inflection prediction. Our experiments on prediction of morphological features and a corresponding word form from sentential context show that the task is challenging, and as morphological complexity increases, performance significantly drops. We found that some morphological categories (e.g., verbal tense) are inherent and typically cannot be predicted from context while others (e.g., adjective number and gender) are contextual and inferred from agreement. Compared to morphological inflection tasks, where morphological features are explicitly provided, and the system has to predict only the form, accuracy on this task is much lower. Finally, we turn to word formation, derivation. Experiments with derivations show that they are less regular and systematic. We study how much a sentential context is indicative of a meaning change type. Our results suggest that even though inflections are more productive and regular than derivations, the latter also present cases of high regularity of meaning and form change, but often require extra information such as etymology, word frequency, and more fine-grained annotation in order to be predicted at high accuracy.
  • Item
    Thumbnail Image
    Autoregressive generative models and multi-task learning with convolutional neural networks
    Schimbinschi, Florin ( 2018)
    At a high level, sequence modelling problems are of the form where the model aims to predict the next element of a sequence based on neighbouring items. Common types of applications include time-series forecasting, language modelling, machine translation and more recently, adversarial learning. One main characteristic of such models is that they assume that there is an underlying learnable structure behind the data generation process, such as it is for language. Therefore, the models used have to go beyond traditional linear or discrete hidden state models. Convolutional Neural Networks (CNNs) are the de facto state of the art in computer vision. Conversely, for sequence modelling and multi-task learning (MTL) problems, the most common choice are Recurrent Neural Networks (RNNs). In this thesis I show that causal CNNs can be successfully and efficiently used for a broad range of sequence modelling and multi-task learning problems. This is supported by applying CNNs to two very different domains, which highlight their flexibility and performance: 1) traffic forecasting in the context of highly dynamic road conditions with non-stationary data and normal granularity (sampling rate) and a high spatial volume of related tasks; 2) learning musical instrument synthesisers with stationary data and a very high granularity (high sampling rate raw waveforms) and thus a high temporal volume, and conditional side information. In the first case, the challenge is to leverage the complex interactions between tasks while keeping the streaming (online) forecasting process tractable and robust to faults and changes (adding or removing tasks). In the second case, the problem is highly related to language modelling, although much more difficult since, unlike words, multiple musical notes can be played at the same time, therefore making the task much more challenging. With the ascent of the Internet of Things (IoT) and Big Data becoming more common, new challenges arise. The four V‘s of Big Data (Volume, Velocity, Variety and Veracity) are studied in the context of multi-task learning for spatio-temporal (ST) prediction problems. These aspects are studied in the first part of this thesis. Traditionally such problems are addressed with static, non-modular linear models that do not leverage Big Data. I discuss what the four V‘s imply for multi-task ST problems and finally show how CNNs can be set up as efficient classifiers for such problems, if the quantization is properly set up for non-stationary data. While the first part is predominantly data-centric, focused on aspects such as Volume (is it useful?) and Veracity (how to deal with missing data?) the second part of the thesis addresses the Velocity and Variety challenges. I also show that even for prediction problems set up as regression, causal CNNs are still the best performing model as compared to state of the art algorithms such as SVRS and more traditional methods such as ARIMA. I introduce TRU-VAR (Topologically Regularized Universal Vector AutoRegression) which, as I show, is a robust, versatile real-time multi-task forecasting framework which leverages domain-specific knowledge (task topology), the Variety (task diversity) and Velocity (online training). Finally, the last part of this thesis is focused on generative CNN models. The main contribution is the SynthNet architecture which is the first capable of learning musical instrument synthesisers end-to-end. The architecture is derived by following a parsimonious approach (reducing complexity) and via an in-depth analysis of the learned representations of the baseline architectures. I show that the 2D projection of each layer gram activations can correspond to resonating frequencies (which gives each musical instrument it‘s timbre). SynthNet trains much faster and it’s generation accuracy is much higher than the baselines. The generated waveforms are almost identical to the ground truth. This has implications in other domains where the the goal is to generate data with similar properties as the data generation process (i.e. adversarial examples). In summary, this thesis makes contributions towards multi-task spatio-temporal time series problems with causal CNNs (set up as both classification and regression) and generative CNN models. The achievements of this thesis are supported by publications which contain an extensive set of experiments and theoretical foundations.
  • Item
    Thumbnail Image
    Natural language processing for resource-poor languages
    Duong, Long ( 2017)
    Natural language processing (NLP) aims, broadly speaking, to teach computers to understand human language. This is hard as the computer must comprehend many facets of language such as semantics, syntax, pragmatics and phonology which are difficult to characterize formally, let alone encode as computer instructions. It is even harder for so-called low-resource languages where the annotated resources are very limited. There are approximately 7,000 languages in the world, but of these only a small fraction (20 languages) are considered high-resource languages. Low-resource languages are in dire need of tools and resources to overcome the resource barrier such that advances in NLP can deliver more widespread benefits. Despite the lack of annotated data, there are some unannotated data resources which might be beneficial for low-resource languages including parallel data, bilingual lexical resources or clues from related languages. However, the means for effectively incorporating these resources to improve the performance of low-resource NLP is an open research question, and the target of this thesis. Out of 7,000 languages, half of them do not have a writing system and many are falling out of use. It is estimated that by the end of this century, half of the world’s languages will be extinct. It is necessary to extend the current NLP techniques to unwritten languages to process and document the languages before they are gone forever. Transfer learning provides an important opportunity for low-resource NLP, whereby annotation is transferred from a source resource-rich language to a target resource poor-language. In this thesis, we successfully apply transfer learning for many low-resource NLP tasks in both semi-supervised and unsupervised settings. We show that only a small amount of annotated text in the target language is sufficient to achieve a large performance improvement by incorporating a resource-rich source language into the model. We report successful applications to low-resource part-of-speech tagging and dependency parsing. We observe improvement in both cascade training, where the model is trained in sequential order, and joint-training. Where no annotated data is available, we instead propose unsupervised transfer learning techniques taking advantage of crosslingual word embeddings. We propose crosslingual syntactic word embeddings where words in both source and target languages are mapped to a shared low-dimensional space based on syntactic context, without using any additional resources such as parallel text. We observe consistent improvements when using crosslingual syntactic embeddings for unsupervised dependency parsing. In the setting where bilingual resources such as bilingual dictionaries are available, we perform better lexical transfer learning via crosslingual word embeddings. We achieve competitive results on both unsupervised crosslingual document classification and dependency parsing tasks. In the extremely low-resource scenario of unwritten languages, we experiment with the neural, attentional model to learn directly from speech in the unwritten language and translations in the higher resource language. Through our preliminary experimentation, we demonstrated the feasibility of the task. Despite the limited focus on part-of-speech tagging, dependency parsing and unwritten language processing, the proposed methods are universal and can be extended to other low-resource NLP tasks. We have considered different levels of resource requirements including semi-supervised learning, unsupervised learning and our preliminary attempt at unwritten language processing. We concluded that (1) processing low-resource languages is hard but can be made possible using transfer learning, (2) there are complementary resources to compensate for the lack of annotated data even for low-resource languages and (3) unwritten languages are very common and should be included in any low-resource natural language processing system, bringing many new challenges for effective modelling.