Computing and Information Systems - Theses

Permanent URI for this collection

http://hdl.handle.net/11343/351

Search Results

Now showing 1 - 10 of 17

A multi-faceted approach to document quality assessment

Shen, Aili ( 2020)

Document quality assessment, due to its complexity and subjectivity, requires considering information from multiple sources and aspects, to capture quality indicators. Grammaticality, readability, stylistics, structure, correctness, and expertise depth reflect the quality of documents from different aspects, with varying importance across different domains. Automatic quality assessment has obvious benefits in terms of time saving and tractability in contexts where the volume of documents is large. In the case of dynamic documents (possibly with multiple authors), such as in the case of Wikipedia, it is particularly pertinent, as any edit potentially has implications for the quality label of that document. In this thesis, we focusing on improving the performance of document quality assessment systems and measure the uncertainty of document quality assessment systems. This thesis addresses four research questions: (1) How can we capture visual features not present in the document text, such as images and visual layout, to enhance representations learned from text content? (2) How can we make use of hand-crafted features widely adopted in traditional machine learning approaches in the context of neural networks, to generate a more accurate document quality assessment system? (3) How can we model the inherent subjectivity of quality assessment in evaluating the performance of quality assessment systems? and (4) Can a quality assessment system detect whether there are intruder sentences in documents and identify the span of any such intruder sentences, given that they interrupt the coherence of documents, thereby lowering their quality? To address the first research question, we propose to use Inception V3 (Szegedy et al., 2016), a widely used visual model in computer vision, to capture visual features from visual renderings of documents, based on the observation that visual renderings of documents can capture these visual features. Inception V3 compares favourably to textual-based models over the Wikipedia and academic paper reviewing datasets. We further propose a joint model to predict document quality by combining visual and textual features. We observe further improvements over both Wikipedia and academic paper reviewing datasets, indicating complementary between visual and textual features, and the general applicability of our proposed method. Next, we propose two methods to enhance the capacity of neural models in predicting the quality of documents by utilising hand-crafted features. In the first method, we propose to concatenate hand-crafted features with neural learned high-level representations, assuming that neural model-learned features may not have captured all the information carried by these hand-crafted features. The second method, on the other hand, utilises hand-crafted features to guide neural model learning by explicitly attending to feature indicators when learning the relationship between the input and target variables, rather than simply concatenating hand-crafted features. Experimental results demonstrate the superiority of our proposed methods over baselines. To imitate people’s disagreement over the inherently subjective task of document quality assessment, we propose to measure the uncertainty in document quality predictions. We investigate two methods: Gaussian processes (GPs) (Rasmussen and Williams, 2006) and random forests (RFs) (Breiman, 2001), which provide not only a prediction of the document quality but also the uncertainty over their predictions. We also propose an asymmetric cost, considering the prediction uncertainty, which is used to measure the performance of two methods in the scenario, where decision-making processes based on model predictions can lead to different costs. Lastly, we propose a new task of detecting whether there is an intruder sentence in a document, generated by replacing an original sentence with a similar sentence from a second document. Existing datasets in coherence detection are not suitable for our task as they are either too small for training current data-hungry models on or do not specify the span of incoherent text. To benchmark model performance over this task, we construct a large-scale dataset consisting of documents from English Wikipedia and CNN news articles. Experimental results show that pre-trained language models which incorporate larger document contexts in pretraining perform remarkably well in-domain, but experience a substantial drop cross-domain. In follow-up analysis based on human annotations, substantial divergences from human intuitions were observed, pointing to limitations in their ability to model document coherence. Further results over a linguistic probe dataset show that pre-trained models fail to identify some linguistic characteristics that affect document coherence, suggesting room to improve for them to truly capture document coherence, and motivating the construction of a dataset with intruder text at the intra-sentential level.
Towards Robust Representation of Natural Language Processing

Li, Yitong ( 2019)

There are many challenges in building robust natural language applications. Machine learning based methods require large volumes of annotated text data, and variations over text can lead to problems, namely: (1) language can be highly variable and expressed with different variations, such as lexical and syntactic. Robust models should be able to handle these variations. (2) A text corpus is heterogeneous, often making language systems domain-brittle. Solutions for domain adaptation and training with corpora comprised of multiple domains are required for language applications in the real world. (3) Many language applications tend to be biased to the demographic of the authors of documents the system is trained on, and lack model fairness. Demographic bias also causes privacy issues when a model is made available to others. In this thesis, I aim to build robust natural language models to tackle these problems, focusing on deep learning approaches which have shown great success in language processing via representation learning. I pose three basic research questions: how to learn representations that are robust to language variation, robust to domain variation, and robust to demographic variables. Each of these research questions is tackled using different approaches, including data augmentation, adversarial learning, and variational inference. For learning robust representations to language variation, I study lexical variation and syntactic variation. To be specific, a regularisation method is proposed to tackle lexical variation, and a data augmentation method is proposed to build robust models, using a range of language generation methods from both linguistic and machine learning perspectives. For domain robustness, I focus on multi-domain learning and investigate domain supervised and unsupervised learning, where domain labels may or may not be available. Two types of models are proposed, via adversarial learning and latent domain gating, to build robust models for heterogeneous text. For robustness to demographics, I show that demographic bias in the training corpus leads to model fairness problems with respect to the demographic of the authors, as well as privacy issues under inference attacks. Adversarial learning is adopted to mitigate bias in representation learning, to improve model fairness and privacy-preservation. To demonstrate the proposed approaches, a range of tasks are considered, including text classification and POS tagging. To evaluate the generalisation and robustness, both in-domain and out-of-domain experiments are conducted with two classes of language tasks: text classification and part-of-speech tagging. For multi-domain learning, multi-domain language identification and multi-domain sentiment classification are conducted, and I simulate domain supervised learning and domain unsupervised learning to evaluate domain robustness. I evaluate model fairness with different demographic attributes and apply inference attacks to test model privacy. The experiments show the advantages and the robustness of the proposed methods. Finally, I discuss the relations between the different forms of robustness, including their commonalities and differences. The limitations of this thesis are discussed in detail, including potential methods to address these shortcomings in future work, and potential opportunities to generalise the proposed methods to other language tasks. Above all, these methods of learning robust representations can contribute towards progress in natural language processing.
Memory-augmented neural networks for better discourse understanding

Liu, Fei ( 2019)

Discourse understanding, due to the multi-sentence nature of discourse, requires consideration of larger contexts, capturing long-range dependencies, and modelling the interactions of entities. While conventional models are unable to keep information stably over long timescales, memory-augmented models are better capable of storing and accessing knowledge, making them well-suited for discourse. In this thesis, we introduce a number of methods for improving memory-augmented models to better understand discourse, validating the utility of memory and establishing a firm base for future studies to build upon.
On the use of prior and external knowledge in neural sequence models

Hoang, Cong Duy Vu ( 2019)

Neural sequence models have recently achieved great success across various natural language processing tasks. In practice, neural sequence models require massive amount of annotated training data to reach their desirable performance; however, there will not always be available data across languages, domains or tasks at hand. Prior and external knowledge provides additional contextual information, potentially improving the modelling performance as well as compensating the lack of large training data, particular in low-resourced situations. In this thesis, we investigate the usefulness of utilising prior and external knowledge for improving neural sequence models. We propose the use of various kinds of prior and external knowledge and present different approaches for integrating them into both training and inference phases of neural sequence models. The followings are main contributions of this thesis which are summarised in two major parts: We present the first part of this thesis which is on Training and Modelling for neural sequence models. In this part, we investigate different situations (particularly in low resource settings) in which prior and external knowledge, such as side information, linguistic factors, monolingual data, is shown to have great benefits for improving performance of neural sequence models. In addition, we introduce a new means for incorporating prior and external knowledge based on the moment matching framework. This framework serves its purpose for exploiting prior and external knowledge as global features of generated sequences in neural sequence models in order to improve the overall quality of the desired output sequence. The second part is about Decoding of neural sequence models in which we propose a novel decoding framework with relaxed continuous optimisation in order to address one of the drawbacks of existing approximate decoding methods, namely the limited ability to incorporate global factors due to intractable search. We hope that this PhD thesis, constituted by two above major parts, will shed light on the use of prior and external knowledge in neural sequence models, both in their training and decoding phases.
Compositional morphology through deep learning

Vylomova, Ekaterina ( 2018)

Most human languages have sophisticated morphological systems. In order to build successful models of language processing, we need to focus on morphology, the internal structure of words. In this thesis, we study two morphological processes: inflection (word change rules, e.g. run -- runs) and derivation (word formation rules, e.g. run -- runner). We first evaluate the ability of contemporary models that are trained using the distributional hypothesis, which states that a word's meaning can be expressed by the context in which it appears, to capture these types of morphology. Our study reveals that inflections are predicted at high accuracy whereas derivations are more challenging due to irregularity of meaning change. We then demonstrate that supplying the model with character-level information improves predictions and makes usage of language resources more efficient, especially in morphologically rich languages. We then address the question of to what extent and which information about word properties (such as gender, case, number) can be predicted entirely from a word's sentential content. To this end, we introduce a novel task of contextual inflection prediction. Our experiments on prediction of morphological features and a corresponding word form from sentential context show that the task is challenging, and as morphological complexity increases, performance significantly drops. We found that some morphological categories (e.g., verbal tense) are inherent and typically cannot be predicted from context while others (e.g., adjective number and gender) are contextual and inferred from agreement. Compared to morphological inflection tasks, where morphological features are explicitly provided, and the system has to predict only the form, accuracy on this task is much lower. Finally, we turn to word formation, derivation. Experiments with derivations show that they are less regular and systematic. We study how much a sentential context is indicative of a meaning change type. Our results suggest that even though inflections are more productive and regular than derivations, the latter also present cases of high regularity of meaning and form change, but often require extra information such as etymology, word frequency, and more fine-grained annotation in order to be predicted at high accuracy.
Analysing the interplay of location, language and links utilising geotagged Twitter content

Afshin, Rahimi ( 2018)

Language use and interactions on social media are geographically biased. In this work we utilise this bias in predictive models of user geolocation and lexical dialectology. User geolocation is an important component of applications such as personalised search and recommendation systems. We propose text-based and network-based geolocation models, and compare them over benchmark datasets yielding state-of-the- art performance. We also propose hybrid and joint text and network geolocation models that improve upon text or network only models and show that the joint models are able to achieve reasonable performance in minimal supervision scenarios, as often happens in real world datasets. Finally, we also propose the use of continuous representations of location, which enables regression modelling of geolocation and lexical dialectology. We show that our proposed data-driven lexical dialectology model provides qualitative insights in studying geographical lexical variation.
Crowdsourcing lexical semantic judgements from bilingual dictionary users

Fothergill, Richard James ( 2017)

Words can take on many meanings, and collecting and identifying example usages representative of the full variety of meanings words can take is a bottleneck to the study of lexical semantics using statistical approaches. To perform supervised word sense disambiguation (WSD), or to evaluate knowledge-based methods, a corpus of texts annotated with senses from a dictionary may be constructed by paid experts. However, the cost usually prohibits more than a small sample of words and senses being represented in the corpus. Crowdsourcing methods promise to acquire data more cheaply, albeit with a greater challenge for quality control. Most crowdsourcing to date has incentivised participation in the form of a payment or by gamification of the resource construction task. However, with paid crowdsourcing the cost of human labour scales linearly with the output size, and while game playing volunteers may be free, gamification studies must compete with a multi-billion dollar games industry for players. In this thesis we develop and evaluate resources for computational semantics, working towards a crowdsourcing method that extracts information from naturally occurring human activities. A number of software products exist for glossing Japanese text with entries from a dictionary for English speaking students. However, the most popular ones have a tendency to either present an overwhelming amount of information containing every sense of every word or else hide too much information and risk removing senses with particular relevance to a specific text. By offering a glossing application with interactive features for exploring word senses, we create an opportunity to crowdsource human judgements about word senses and record human interaction with semantic NLP.
Supervised algorithms for complex relation extraction

Khirbat, Gitansh ( 2017)

Binary relation extraction is an essential component of information extraction systems, wherein the aim is to extract meaningful relations that might exist between a pair of entities within a sentence. Binary relation extraction systems have witnessed a significant improvement over past three decades, ranging from rule-based systems to statistical natural language techniques including supervised, semi-supervised and unsupervised machine learning approaches. Modern question answering and summarization systems have motivated the need for extracting complex relations wherein the number of related entities is more than two. Complex relation extraction (CRE) systems are highly domain specific and often rely on traditional binary relation extraction techniques employed in a pipeline fashion, thus susceptible to processing-induced error propagation. In this thesis, we investigate and develop approaches to extract complex relations directly from natural language text. In particular, we deviate from the traditional disintegration of complex relations into constituent binary relations and propose usage of shortest dependency parse spanning the n related entities as an alternative to facilitate direct CRE. We investigate this proposed approach by a comprehensive study of supervised learning algorithms with a special focus on training support vector machines, convolutional neural networks and deep learning ensemble algorithms. Research in the domain of CRE is stymied by paucity of annotated data. To facilitate future exploration, we create two new datasets to evaluate our proposed CRE approaches on a pilot biographical fact extraction task. An evaluation of results on new and standard datasets concludes that usage of shortest path dependency parse in a supervised setting enables direct CRE with an improved accuracy, beating current state-of-the-art CRE systems. We further show the application of CRE to achieve state-of-the-art performance for directly extracting events without the need of disintegrating them into event trigger and event argument extraction processes.
Coreference resolution for biomedical pathway data

Choi, Miji Jooyoung ( 2017)

The study of biological pathways is a major activity in the life sciences. Biological pathways provide understanding and interpretation of many different kinds of biological mechanisms such as metabolism, sending of signals between cells, regulation of gene expression, and production of cells. If there are defects in a pathway, the result may be a disease. Thus, biological pathways are used to support diagnosis of disease, more effective drug prescription, or personalised treatments. Even though there are many pathway resources providing useful information discovered with manual efforts, a great deal of relevant information concerning in such pathways is scattered through the vast biomedical literature. With the growth in the volume of the biomedical literature, many natural language processing methods for automatic information extraction have been studied, but there still exist a variety of challenges such as complex or hidden representations due to the use of coreference expressions in texts. Linguistic expressions such as it, they, or the gene are frequently used by authors to avoid repeating the names of entities or repeating complex descriptions that have previously been introduced in the same text. This thesis addresses three research goals: (1) examining whether an existing coreference resolution approach in the general domain can be adapted to the biomedical domain; (2) investigation of a heuristic strategy for coreference resolution in the biomedical literature; and (3) examining how coreference resolution can improve biological pathway data from the perspectives of information extraction, and of evaluation of existing pathway resources. In this thesis, we propose a new categorical framework that provides detailed analysis of performance of coreference resolution systems, based on analysis of syntactic and semantic characteristics of coreference relations in the biomedical domain. The framework not only can identify weaknesses of existing approaches, but also can provide insights into strategies for further improvement. We propose an approach to biomedical domain-specific coreference resolution that combines a set of syntactically and semantically motivated rules in terms of coreference type. Finally, we demonstrate that coreference resolution is a valuable process for pathway information discovery, through case studies. Our results show that an approach incorporating a coreference resolution process significantly improves information extraction performance.
Unsupervised all-words sense distribution learning

Bennett, Andrew ( 2016)

There has recently been significant interest in unsupervised methods for learning word sense distributions, or most frequent sense information, in particular for applications where sense distinctions are needed. In addition to their direct application to word sense disambiguation (WSD), particularly where domain adaptation is required, these methods have successfully been applied to diverse problems such as novel sense detection or lexical simplification. Furthermore, they could be used to supplement or replace existing sources of sense frequencies, such as SemCor, which have many significant flaws. However, a major gap in the past work on sense distribution learning is that it has never been optimised for large-scale application to the entire vocabularies of a languages, as would be required to replace sense frequency resources such as SemCor. In this thesis, we develop an unsupervised method for all-words sense distribution learning, which is suitable for language-wide application. We first optimise and extend HDP-WSI, an existing state-of-the-art sense distribution learning method based on HDP topic modelling. This is mostly achieved by replacing HDP with the more efficient HCA topic modelling algorithm in order to create HCA-WSI, which is over an order of magnitude faster than HDP-WSI and more robust. We then apply HCA-WSI across the vocabularies of several languages to create LexSemTm, which is a multilingual sense frequency resource of unprecedented size. Of note, LexSemTm contains sense frequencies for approximately 88% of polysemous lemmas in Princeton WordNet, compared to only 39% for SemCor, and the quality of data in each is shown to be roughly equivalent. Finally, we extend our sense distribution learning methodology to multiword expressions (MWEs), which to the best of our knowledge is a novel task (as is applying any kind of general-purpose WSD methods to MWEs). We demonstrate that sense distribution learning for MWEs is comparable to that for simplex lemmas in all important respects, and we expand LexSemTm with MWE sense frequency data.

Computing and Information Systems - Theses

Permanent URI for this collection

Filters

Date

Author

Subject

Type

Settings

Sort By

Results per page

Statistics

Citations

Search Results