Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 6 of 6
  • Item
    Thumbnail Image
    Detection and Analysis of Climate Change Scepticism
    Bhatia, Shraey ( 2024-01)
    Climate change, predominantly driven by human activities, poses a threat through effects like rising sea levels, melting ice caps, extreme droughts, and species extinction.The IPCC’s 5th and 6th reports highlight the urgency of limiting global warming, with the latter projecting a concerning 1.5 degree C rise by 2040. Despite scientific consensus, the digital sphere is inundated with content that fuels scepticism, often sponsored by specific lobby groups. These articles, under the umbrella term of climate change scepticism (CCS), weave a blend of misinformation, propaganda, hoaxes and sensationalism, undermining collective climate action. This thesis aims to offer strategies to address this misleading narrative. In this thesis we probe CCS through 4 dimensions: (1) understanding the underlying themes in the data, (2) detecting CCS articles, (3) understanding and detecting the framing and neutralization tactics used to construct CCS narratives and (4) fact-checking the veracity of claims, elucidating reasons for potential inaccuracies. A notable challenge in addressing the aforementioned tasks is the limited availability of data. Throughout this thesis, we leverage advancements in natural language processing (NLP) to mitigate this. Pre-trained language models (PLMs) and their scaled counterparts, large language models (LLMs), have revolutionized our capacity to comprehend and generate text that mirrors human language. These models, adept at learning from real-world knowledge and semantics from extensive datasets prove extraordinarily effective over a diverse range of language tasks. Topic models distil document collections into key themes, represented by groups of words or “topics” without the need of human labelling or any a priori notion of the content of collection. In essence, they offer a means of exposing underlying themes in the documents.Each document typically aligns with one or several themes, but capturing the essence of the collection’s context remains a challenge. In this thesis, we introduce methods that enhance the quality of topic outputs to better mirror the context of document collections. For detection of CCS articles, there was no dataset available in this domain. We bridge this gap by scraping and compiling a dataset articles known to exhibit climate change scepticism. By extending training of PLMs on this dataset, we enhance their ability to discern stylistic and linguistic elements of CCS which allows the models to not only distinguish between CCS and non-CCS articles but also to highlight misleading spans indicative of scepticism. To delve deeper into the intricacies of CCS narratives, we must analyze their argumentative framing. This is accomplished by employing techniques of framing and neutralization which translates it into a multi-task classification task. We propose an annotation task and collect human judgements. Given that data collection can be resource-intensive, we leverage unlabelled data in a semi-supervised setting achieving substantial performance gains. Finally, we dive into the task of explanation generation to detail the reasons behind a claim’s inaccuracies. Using LLMs in a retrieval-augmented approach, we connect the LLM to an external knowledge source like peer reviewed papers via a retriever. This retriever fetches pertinent “facts” related to the claim, enabling the LLM to both verify and explain the claim grounded to these facts. LLMs are prone to generate ungrounded information, commonly referred to as “hallucinations”. We investigate approaches to detect such inaccuracies, then introduce methods to reduce these hallucinations, and finally employ LLM-based evaluations to assess the quality of the produced content.
  • Item
    Thumbnail Image
    Lexical Semantics of the Long Tail
    Wada, Takashi ( 2023-12)
    Natural language data is characterised by containing a variety of long-tail instances. For instance, whilst there exists an abundance of text data on the web for major languages such as English, there is a dearth of data for a great number of minor languages. Furthermore, when we look at the corpus data in each language, it usually consists of a very small number of high-frequency words and a plethora of long-tail expressions that are not commonly used in text, such as scientific jargon and multiword expressions. Generally, those long-tail instances draw little attention from the research community, largely because they often have a biased interest in a handful of resource-rich languages and models' overall performance on a specific task, which is, in many cases, not heavily influenced by the long-tail instances in text. In this thesis, we aim to shed light on the long-tail instances in language and explore NLP models that represent their lexical semantics effectively. In particular, we focus on the three types of long-tail instances, namely, extremely low-resource languages, rare words, and multiword expressions. Firstly, for extremely low-resource languages, we propose a new cross-lingual word embedding model that works well with very limited data, and show its effectiveness on the task of aligning semantically equivalent words between high- and low-resource languages. For evaluation, we conduct experiments that involve three endangered languages, namely Yongning Na, Shipibo-Konibo and Griko, and demonstrate that our model performs well on real-world language data. Secondly, with regard to rare words, we first investigate how well recent embedding models can capture lexical semantics in general on lexical substitution, where given a target word in context, a model is tasked with retrieving its synonymous words. To this end, we propose a new lexical substitution method that effectively makes use of existing embedding models, and show that it performs very well on English and Italian, especially for retrieving low-frequency substitutes. We also reveal a couple of limitations of current embedding models: (1) they are highly affected by morphophonetic and morphosyntactic biases, such as article–noun agreement in English and Italian; and (2) they often represent rare words poorly when they are segmented into multiple subwords. To address the second limitation, we propose a new method that performs very well in predicting synonyms of rare words, and demonstrate its effectiveness on lexical substitution and simplification. Lastly, to represent multiword expressions (MWEs) effectively, we propose a new method that paraphrases MWEs with more literal expressions that are easier to understand, e.g. swan song with final performance. Compared to previous approaches that resort to human-crafted resources such as dictionaries, our model is fully unsupervised and relies on monolingual data only, making it applicable to resource-poor languages. For evaluation, we perform experiments in two high-resource languages (English and Portuguese) and one low-resource language (Galician), and demonstrate that our model generates high-quality paraphrases of MWEs in all languages, and aids pre-trained sentence embedding models to encode sentences that contain MWEs by paraphrasing them with literal expressions.
  • Item
    Thumbnail Image
    Generalization Lessons from Biomedical Relation Extraction using Pretrained Transformer Models
    Elangovan, Aparna ( 2023-12)
    Curating structured knowledge for storing in biomedical knowledge databases, requires human experts to annotate relationships, thus making maintenance of these databases expensive and difficult to scale to the large quantities of information presented in scientific publications. It is challenging to ensure that the information is comprehensive and up-to-date. Hence, we investigate the generalization capabilities of state-of-the-art natural language processing (NLP) techniques to automate relation extraction to aid human curation. In NLP, deep learning-based architectures, in particular pretrained transformer models with millions of parameters enabling them to achieve state of the art (SOTA) performance, have been dominating leaderboards on public benchmark datasets, usually achieved by fine-tuning pretrained transformer models on the target dataset task. In our research, we investigate the generalizability of such SOTA models – fine-tuned pretrained transformer models – in biomedical relation extraction for real-world applications where the performance expectations of these models need to be applicable beyond the official test sets. While our experiments focus on the current SOTA models, our findings have broader implications on generalization of NLP models and their performance evaluations. We ask the following research questions: 1. How generalizable are fine-tuned pretrained transformer models in biomedical relation extraction? 2. What factors lead to poor generalizability despite high test set performance of fine-tuned pretrained transformer models? 3. How can we improve qualitative aspects of the training data to improve real-world generalization performance of fine-tuned pretrained transformer models? The contributions are: 1) We identify a large performance gap compared to the test set when a SOTA fine-tuned pretrained transformer model is applied at large scale. This substantial generalization gap has neither been verified nor reported in prior large scale biomedical relation extraction studies. 2) We identify that high similarity between training and test sets, even with random splits, can result in inflated performance measurements. We suggest stratifying the test set based on similarity to the training set to provide a more effective interpretation of the results to understand memorization versus generalization capabilities of a model. Furthermore, we also find that the fine-tuned pretrained transformer models appear to rely on spurious correlations that are present in both training and test sets, obtaining inflated test set performance. 3) We also find that, for a given quantity of training data, qualitative aspects can boost performance when fine-tuning pretrained transformers. More specifically, we find that incorporating training samples that are quite similar to one another but have different ground truth labels – we call them human-adversarials – in low to moderate proportions can boost generalization performance by up to 20 points on fine-tuned pretrained transformer models such as BERT, BioBERT and RoBERTa. On the other hand, training samples that are quite similar to one another and have the same ground truth labels – we call them human-affables – can potentially degrade generalization performance. We thus demonstrate that merely aiming for higher quantities of training data is not sufficient to improve generalization. 4) As a result of our findings 1 & 2, we propose to the NLP community that confirming linguistic capabilities as the cause of performance gains even within the context of the test set is crucial to generalization, adapting generalization principles from clinical studies. We thus advocate for effective test sets & evaluation strategies, including adapting concepts such as randomized controlled trials from clinical studies for NLP to establish causation, as our experiments demonstrate how a test set constructed using the standard practice of random splits may not be sufficient to measure generalization capabilities of a model. Overall, in this thesis, we closely examine model generalization and aim to strengthen how machine learning models are evaluated. While we do so in the context of biomedical relation extraction, where generalizability is critical – our findings are applicable to the entire field of machine learning model evaluation in NLP.
  • Item
    Thumbnail Image
    Assessing and Improving Fairness in Models of Human Language
    Han, Xudong ( 2023-06)
    Models of human language often learn and amplify dataset biases, leading to discrimination such as opportunity inequality in job applications. In this thesis, we aim to assess and improve fairness in natural language processing (NLP). Overall, we contribute to the field of fairness in NLP by offering guidance on designing fairness metrics to assess fairness, mitigating bias in text representations and training datasets to improve fairness, and developing tools to benchmark fairness research. The first key contribution of the thesis is assessment: how to quantify fairness. While several metrics have been proposed to evaluate fairness based on different assumptions about the nature of fairness, there is little work on clarifying what those assumptions are, or on guiding the selection of evaluation metrics from the fundamental understanding of fairness. To address this, we propose a generalized aggregation framework to illustrate the assumptions about fairness corresponding to a set of existing fairness metrics. Based on the understanding of these assumptions, we then provide recommendations to standardize further research on fairness. We also present two novel metrics to quantitatively measure the performance--fairness trade-offs, which are shown to be essential for systematic model selection and comparison. Then we propose to improve fairness by mitigating bias in training datasets. We hypothesize that models trained over debiased datasets should result in fairer predictions. To test this, we first adopt long-tail learning methods for dataset bias mitigation and show that our novel balanced training method substantially improves fairness. Together with other existing debiasing methods, the dataset debiasing method is then employed in conjunction with dataset distillation in our proposed framework, resulting in better fairness at reduced training cost. NLP models are also shown to unintentionally encode protected information even if such attributes are not provided as explicit inputs in the model training (i.e. protected information leakage). We propose novel adversarial learning methods to mitigate leakage by removing protected information from text representations. In particular, we employ an ensemble training algorithm for adversarial learning to improve the stability of adversarial learning and introduce additional constraints of orthogonality between adversaries to increase representational fairness. Not only can the proposed method improve fairness over standard adversarial debiasing, it also improves performance--fairness trade-offs and stability simultaneously. A key assumption that most debiasing methods have made is that protected attributes associated with training instances are annotated in the dataset, which often does not hold in real-world applications. To address this problem, we decouple the training of discriminators and the main task model for adversarial debiasing and find that a small number of protected-labeled instances is sufficient for our proposed method to achieve comparable results with standard adversarial learning. We further improve on this method with a meta-algorithm to identify regions in the hidden space that consistently under/over-perform, which derives a clustering based on protected information leakage and training behaviour. Extensive experiments show that such cluster information can be used with existing supervised debiasing methods to mitigate bias without observing protected labels. Reproducing and evaluating these methods can be difficult despite remarkable achievements in assessing and improving fairness. Therefore, the last contribution of this thesis is introduced as FairLib, a unified framework for assessing and improving fairness. FairLib is an open-source Python library that can be used to assess benchmark datasets, assess and improve fairness with plenty of built-in methods, and compare and visualize results. One immediate benefit of using FairLib is to systematically evaluate debiasing methods, which have generally only been examined over a few benchmark datasets under narrow distributions. Because dataset bias plays such an important role in fairness, we additionally generalize the dataset debiasing methods to simulate different data conditions based on FairLib, conducting a systematic evaluation and establishing benchmarks for debiasing methods.
  • Item
    Thumbnail Image
    Unsupervised all-words sense distribution learning
    Bennett, Andrew ( 2016)
    There has recently been significant interest in unsupervised methods for learning word sense distributions, or most frequent sense information, in particular for applications where sense distinctions are needed. In addition to their direct application to word sense disambiguation (WSD), particularly where domain adaptation is required, these methods have successfully been applied to diverse problems such as novel sense detection or lexical simplification. Furthermore, they could be used to supplement or replace existing sources of sense frequencies, such as SemCor, which have many significant flaws. However, a major gap in the past work on sense distribution learning is that it has never been optimised for large-scale application to the entire vocabularies of a languages, as would be required to replace sense frequency resources such as SemCor. In this thesis, we develop an unsupervised method for all-words sense distribution learning, which is suitable for language-wide application. We first optimise and extend HDP-WSI, an existing state-of-the-art sense distribution learning method based on HDP topic modelling. This is mostly achieved by replacing HDP with the more efficient HCA topic modelling algorithm in order to create HCA-WSI, which is over an order of magnitude faster than HDP-WSI and more robust. We then apply HCA-WSI across the vocabularies of several languages to create LexSemTm, which is a multilingual sense frequency resource of unprecedented size. Of note, LexSemTm contains sense frequencies for approximately 88% of polysemous lemmas in Princeton WordNet, compared to only 39% for SemCor, and the quality of data in each is shown to be roughly equivalent. Finally, we extend our sense distribution learning methodology to multiword expressions (MWEs), which to the best of our knowledge is a novel task (as is applying any kind of general-purpose WSD methods to MWEs). We demonstrate that sense distribution learning for MWEs is comparable to that for simplex lemmas in all important respects, and we expand LexSemTm with MWE sense frequency data.
  • Item
    Thumbnail Image
    Structured classification for multilingual natural language processing
    Blunsom, Philip ( 2007-06)
    This thesis investigates the application of structured sequence classification models to multilingual natural language processing (NLP). Many tasks tackled by NLP can be framed as classification, where we seek to assign a label to a particular piece of text, be it a word, sentence or document. Yet often the labels which we’d like to assign exhibit complex internal structure, such as labelling a sentence with its parse tree, and there may be an exponential number of them to choose from. Structured classification seeks to exploit the structure of the labels in order to allow both generalisation across labels which differ by only a small amount, and tractable searches over all possible labels. In this thesis we focus on the application of conditional random field (CRF) models (Lafferty et al., 2001). These models assign an undirected graphical structure to the labels of the classification task and leverage dynamic programming algorithms to efficiently identify the optimal label for a given input. We develop a range of models for two multilingual NLP applications: word-alignment for statistical machine translation (SMT), and multilingual super tagging for highly lexicalised grammars.