Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 356
  • Item
    Thumbnail Image
    Linear and Non-linear Exact Predict+Optimize Models for Mixed Integer Programming
    Guler, Ali Ugur ( 2023-11)
    Data-driven decision making is rapidly becoming a significant part of decision-making processes. On many occasions, the coefficients of a decision problem are not known and have to be predicted. Machine learning has been widely successful for prediction tasks. However, traditionally predicting coefficients and solving a decision problem have been independent tasks. Predict+Optimize aims to integrate the prediction and optimization parts of data-driven decision making by training models using the optimization objective (known as regret). This brings two important challenges, minimizing a non-differentiable and often discrete loss function, and requiring to solve a time intensive (NP-Hard) problem in the case of complex combinatorial problems. A widespread predict+optimize approach is to train predictive models using a convex surrogate of the regret. This can bring approximation errors and impact the performance of a predict+optimize model. In this thesis, we propose both linear (DnL) and non-linear (ReLU-DnL) exact predict+optimize models that can learn through the non-differentiable regret directly, without using surrogate functions. We use the idea of transition points in order to develop these models. Our DnL and ReLU DnL models are more effective at minimizing regret compared to surrogate-based models. DnL and ReLU DnL and can be trained significantly faster than the existing transition point based models and also show flexibility, and can be applied to a wider range of optimization problems.
  • Item
    Thumbnail Image
    On the Use of Progressive Matrix Problems in Understanding Abstraction and Generalisation in Vision Systems
    Spratley, Steven Roger ( 2024-02)
    Abstract reasoning is a hallmark of generally-intelligent agents, and is the primary aptitude tested for by progressive matrix problems (PMPs), long held to be a reliable indicator of cognitive ability. In the last five years, PMPs have been applied to the creation and evaluation of deep-learnt computer vision systems, with the goal of better modelling such reasoning abilities. While this is a promising direction, it is nascent and has experienced several shortcomings, with the most severe being an ironic lack of awareness of the brittleness to out-of-distribution data exhibited by the deep-learning paradigm; a brittleness that PMP datasets were created to aid. This has resulted in systems taking "shortcuts" over datasets, exploiting them with non-robust features, and this often happens without immediate knowledge of the research community. This thesis furthers the effective use of PMPs in this space by deepening the understanding and appreciation of key themes including abstraction, analogical reasoning, generalisation, and inference. It expounds upon why such themes are of crucial importance to the future of computer vision — indeed, to all artificial intelligence research — and contributes model architectures, PMP datasets, methodological developments, and broad interdisciplinary discussion, all working towards their promotion and evaluation. In doing so, this work advances the development of vision systems that can more robustly demonstrate the ability to reason in novel environments.
  • Item
    Thumbnail Image
    Machine Learning Models for Vaccine Development and Immunotherapy
    Moreira da Silva, Bruna ( 2023-11)
    Therapeutic antibodies offer exceptional specificity and affinity to detect and eliminate antigens, making them valuable as therapeutics and in diagnostics. The antigen recognition and neutralisation is based on the efficient binding to epitopes, antigen regions recognised by antibodies that elicit an immune response. The identification and mapping of epitopes, however, are yet dependent on resource-intensive experimental techniques that do not scale adequately given the vast search space and diversity of antigens. Epitope identification and prioritisation is a cornerstone of immunotherapies, antibody design, and vaccine development. Consistent progress of computational approaches has been observed to improve in silico epitope prediction at scale, specifically driven by machine learning algorithms in the past decade. Yet, low predictive power and skewed data sets towards specific pathogens can still be observed. This thesis focused on better exploring publicly available experimental antibody-antigen data, improving modelling and identification of distinguishing epitope features that de- rive meaningful biological insights. On this basis, I have curated high-quality data from multiple resources, resulting in large scale and non-redundant epitope data sets. Besides, I proposed novel featurisation techniques grounded on graph-based approaches to model and discriminate epitopes from the remainder antigen surface, that were demonstrated to differentiate both classes. In addition, I have leveraged machine learning algorithms and data analysis for better predictive and explainable models, which have been translated and made available as easy-to-use web servers with Application Programming Interfaces for programmatic access and integration into Bioinformatics pipelines. By exploring these advanced computational methods, this thesis significantly contributes to improving the prediction of B-cell epitopes, leading to a better understanding of antibody targets, which I believe will facilitate the ongoing development of therapeutics and diagnostics.
  • Item
    Thumbnail Image
    Detection and Analysis of Climate Change Scepticism
    Bhatia, Shraey ( 2024-01)
    Climate change, predominantly driven by human activities, poses a threat through effects like rising sea levels, melting ice caps, extreme droughts, and species extinction.The IPCC’s 5th and 6th reports highlight the urgency of limiting global warming, with the latter projecting a concerning 1.5 degree C rise by 2040. Despite scientific consensus, the digital sphere is inundated with content that fuels scepticism, often sponsored by specific lobby groups. These articles, under the umbrella term of climate change scepticism (CCS), weave a blend of misinformation, propaganda, hoaxes and sensationalism, undermining collective climate action. This thesis aims to offer strategies to address this misleading narrative. In this thesis we probe CCS through 4 dimensions: (1) understanding the underlying themes in the data, (2) detecting CCS articles, (3) understanding and detecting the framing and neutralization tactics used to construct CCS narratives and (4) fact-checking the veracity of claims, elucidating reasons for potential inaccuracies. A notable challenge in addressing the aforementioned tasks is the limited availability of data. Throughout this thesis, we leverage advancements in natural language processing (NLP) to mitigate this. Pre-trained language models (PLMs) and their scaled counterparts, large language models (LLMs), have revolutionized our capacity to comprehend and generate text that mirrors human language. These models, adept at learning from real-world knowledge and semantics from extensive datasets prove extraordinarily effective over a diverse range of language tasks. Topic models distil document collections into key themes, represented by groups of words or “topics” without the need of human labelling or any a priori notion of the content of collection. In essence, they offer a means of exposing underlying themes in the documents.Each document typically aligns with one or several themes, but capturing the essence of the collection’s context remains a challenge. In this thesis, we introduce methods that enhance the quality of topic outputs to better mirror the context of document collections. For detection of CCS articles, there was no dataset available in this domain. We bridge this gap by scraping and compiling a dataset articles known to exhibit climate change scepticism. By extending training of PLMs on this dataset, we enhance their ability to discern stylistic and linguistic elements of CCS which allows the models to not only distinguish between CCS and non-CCS articles but also to highlight misleading spans indicative of scepticism. To delve deeper into the intricacies of CCS narratives, we must analyze their argumentative framing. This is accomplished by employing techniques of framing and neutralization which translates it into a multi-task classification task. We propose an annotation task and collect human judgements. Given that data collection can be resource-intensive, we leverage unlabelled data in a semi-supervised setting achieving substantial performance gains. Finally, we dive into the task of explanation generation to detail the reasons behind a claim’s inaccuracies. Using LLMs in a retrieval-augmented approach, we connect the LLM to an external knowledge source like peer reviewed papers via a retriever. This retriever fetches pertinent “facts” related to the claim, enabling the LLM to both verify and explain the claim grounded to these facts. LLMs are prone to generate ungrounded information, commonly referred to as “hallucinations”. We investigate approaches to detect such inaccuracies, then introduce methods to reduce these hallucinations, and finally employ LLM-based evaluations to assess the quality of the produced content.
  • Item
    Thumbnail Image
    Developing Delirium Prediction Models in Clinical Settings: Integrating Machine Learning and Natural Language Processing with Patient Data, Clinical Notes, and Antipsychotics Data
    Amjad, Sobia ( 2023-12)
    This thesis details the effort to improve delirium identification. Delirium is an acute confusion in 11% to 42% of hospitalised patients and 60-80% of elderly patients. It is a serious neuropsychiatric syndrome frequently underdiagnosed, resulting in extended hospital stays, higher mortality rates and increased health care costs. Furthermore, because it is a cognitive condition, there is no clear definition, leading to high levels of overlap with other neuropsychiatric disorders. Electronic health records (EHR) data has significantly advanced health research, offering automated tools for predictive analysis. The objective is to leverage EHR to develop machine learning models (ML), including those employing natural language processing (NLP) techniques for delirium prediction. The ML model assesses whether a patient will likely receive an ICD-10 code for delirium within 24 to 48 hours of hospital presentation, encouraging timely interventions for patients at risk in the wards. The data collected from tertiary-level hospitals explores the key features that contribute to the performance of predictive models. The study focuses on optimising early delirium prediction, thereby refining the diagnostic process for delirium and enhancing both the timeliness and accuracy of results. The first study, "Machine Learning-Based Delirium Prediction in Hospitalized Patients Using Routinely Available Administrative and Clinical Data," developed models using two hospital data, proving the generalizability of our predictive models. This study worked with data from around 200,000 patients. The challenge was that delirium-positive cases were rare, less than 5\%. Thus, we directed our research towards using language in clinical notes to identify delirium. The second study, "Advancing Delirium Classification: A Clinical Notes-based Natural Language Processing-Supported Machine Learning Model," applied NLP to classify delirium at the clinical notes level. This method allowed us to explore delirium-suggestive words within the notes, thus enhancing classification accuracy beyond traditional dictionary-based approaches. After applying methods on structured clinical data and notes, we implemented those methods for the patient-level delirium prediction. Also, we incorporated antipsychotic data into our prediction model in the third study entitled "EHR-Based Delirium Prediction: A Unified Data-driven Model with Clinical Data, Notes, and Antipsychotics". However, just like we faced a challenge with the rare cases of delirium in our first project, we also dealt with the issue of having limited data on antipsychotics and notes. Despite these challenges, we managed to navigate the complex nature of healthcare data. We developed a model that leverages all available data from the initial 24 to 48 hours of hospital presentation, which delivered promising results. Our study also looked carefully at how to use drug-related information without letting it create bias in our delirium predictions. Classification techniques included Logistic Regression, Extreme gradient boosting, Support Vector Machines, and Random Forests. The Logistic regression model is discussed in main studies due to its common use in medical research and acceptance within the broader artificial intelligence community. We employed visualizations to explain predictions and highlight the transparency of our predictive models. Moreover, we presented visual graphs for each study to explain the model and demonstrate their reliability for delirium prediction to gain the trust of healthcare professionals.
  • Item
    Thumbnail Image
    Machine Learning Models for Vaccine Development and Immunotherapy
    Moreira da Silva, Bruna ( 2023-11)
    Therapeutic antibodies offer exceptional specificity and affinity to detect and eliminate antigens, making them valuable as therapeutics and in diagnostics. The antigen recognition and neutralisation is based on the efficient binding to epitopes, antigen regions recognised by antibodies that elicit an immune response. The identification and mapping of epitopes, however, are yet dependent on resource-intensive experimental techniques that do not scale adequately given the vast search space and diversity of antigens. Epitope identification and prioritisation is a cornerstone of immunotherapies, antibody design, and vaccine development. Consistent progress of computational approaches has been observed to improve in silico epitope prediction at scale, specifically driven by machine learning algorithms in the past decade. Yet, low predictive power and skewed data sets towards specific pathogens can still be observed. This thesis focused on better exploring publicly available experimental antibody-antigen data, improving modelling and identification of distinguishing epitope features that derive meaningful biological insights. On this basis, I have curated high-quality data from multiple resources, resulting in large scale and non-redundant epitope data sets. Besides, I proposed novel featurisation techniques grounded on graph-based approaches to model and discriminate epitopes from the remainder antigen surface, that were demonstrated to differentiate both classes. In addition, I have leveraged machine learning algorithms and data analysis for better predictive and explainable models, which have been translated and made available as easy-to-use web servers with Application Programming Interfaces for programmatic access and integration into Bioinformatics pipelines. By exploring these advanced computational methods, this thesis significantly contributes to improving the prediction of B-cell epitopes, leading to a better understanding of antibody targets, which I believe will facilitate the ongoing development of therapeutics and diagnostics.
  • Item
    Thumbnail Image
    Planning and Goal Recognition in Humans and Machines
    Zhang, Chenyuan ( 2023-12)
    The rapid advancement of artificial intelligence, exemplified by systems such as AlphaGo and large language models, has great potential to contribute to the development of human-like intelligence. However, fundamental differences exist between the underlying mechanisms of these systems and those of biological organisms. For instance, humans can achieve impressive performance with limited data and computing resources, while existing algorithms often require significant amounts of data and computing power for real-time operations. One of the reasons for this disparity is the human ability to plan in a model-based sense, making computational models that can capture human planning behavior valuable to bridge the gap between existing AI systems and human-like intelligence. This thesis explores the effectiveness of planning algorithms in modeling human behavior. Existing literature often overlooks timing information, and I develop a novel tree-based model that aims to capture both human action selection and human reaction times. The thesis also introduces a timing-sensitive goal recognition framework that incorporates timing information, and uses this framework to model human goal inference. My findings indicate that a Bayesian framework that incorporates a prior based on goal difficulty and a likelihood derived from an online planner accurately predicts human goal inference. This thesis underscores the promise of planning algorithms in mimicking human behavior and their utility in human-robot collaborations. More generally, it suggests that planning algorithms have an important role to play in advancing human-like intelligence.
  • Item
    Thumbnail Image
    Lexical Semantics of the Long Tail
    Wada, Takashi ( 2023-12)
    Natural language data is characterised by containing a variety of long-tail instances. For instance, whilst there exists an abundance of text data on the web for major languages such as English, there is a dearth of data for a great number of minor languages. Furthermore, when we look at the corpus data in each language, it usually consists of a very small number of high-frequency words and a plethora of long-tail expressions that are not commonly used in text, such as scientific jargon and multiword expressions. Generally, those long-tail instances draw little attention from the research community, largely because they often have a biased interest in a handful of resource-rich languages and models' overall performance on a specific task, which is, in many cases, not heavily influenced by the long-tail instances in text. In this thesis, we aim to shed light on the long-tail instances in language and explore NLP models that represent their lexical semantics effectively. In particular, we focus on the three types of long-tail instances, namely, extremely low-resource languages, rare words, and multiword expressions. Firstly, for extremely low-resource languages, we propose a new cross-lingual word embedding model that works well with very limited data, and show its effectiveness on the task of aligning semantically equivalent words between high- and low-resource languages. For evaluation, we conduct experiments that involve three endangered languages, namely Yongning Na, Shipibo-Konibo and Griko, and demonstrate that our model performs well on real-world language data. Secondly, with regard to rare words, we first investigate how well recent embedding models can capture lexical semantics in general on lexical substitution, where given a target word in context, a model is tasked with retrieving its synonymous words. To this end, we propose a new lexical substitution method that effectively makes use of existing embedding models, and show that it performs very well on English and Italian, especially for retrieving low-frequency substitutes. We also reveal a couple of limitations of current embedding models: (1) they are highly affected by morphophonetic and morphosyntactic biases, such as article–noun agreement in English and Italian; and (2) they often represent rare words poorly when they are segmented into multiple subwords. To address the second limitation, we propose a new method that performs very well in predicting synonyms of rare words, and demonstrate its effectiveness on lexical substitution and simplification. Lastly, to represent multiword expressions (MWEs) effectively, we propose a new method that paraphrases MWEs with more literal expressions that are easier to understand, e.g. swan song with final performance. Compared to previous approaches that resort to human-crafted resources such as dictionaries, our model is fully unsupervised and relies on monolingual data only, making it applicable to resource-poor languages. For evaluation, we perform experiments in two high-resource languages (English and Portuguese) and one low-resource language (Galician), and demonstrate that our model generates high-quality paraphrases of MWEs in all languages, and aids pre-trained sentence embedding models to encode sentences that contain MWEs by paraphrasing them with literal expressions.
  • Item
    Thumbnail Image
    Computational modeling of the epidemiological dynamics of the skin pathogens Group A Streptococcus and Sarcoptes scabiei
    Tellioglu, Nefel ( 2023-11)
    Sarcoptes scabiei is a skin pathogen that causes substantial health burdens in humans. An estimated 455 million people are affected by scabies annually, resulting in an estimate of 3.8 million disability-adjusted life years annually. Scratching from scabies can result in further bacterial skin infections including Group A Streptococcus (GAS) infections which increases the burden of scabies. GAS infections can lead to severe health conditions such as acute rheumatic fever and rheumatic heart disease. Each year, around 18 million people worldwide suffer from severe GAS-related diseases, resulting in 500,000 deaths. Sarcoptes scabiei and Group A Streptococcus are endemic in many underprivileged populations such as Indigenous communities in Australia. A number of factors are likely to play a role in the high burden of skin pathogens in these settings including heterogeneities in the pathogen population (pathogens having multiple strains with varying characteristics) and host population (populations with varying disease prevalence and transmission rate). While these factors make it difficult to manage disease burden, computational models can help us to understand transmission mechanisms as well as control health burden. In this thesis, I focus on Sarcoptes scabiei and GAS and aim to understand the underlying transmission mechanisms of these skin pathogens and to provide insights into the efficacy of community-specific control strategies to reduce the disease burden using computational modelling. I focus on three key research questions in which I investigate the impact of pathogen and host heterogeneities on disease transmission and identify effective control strategies using computational models. Controlling the spread of pathogens with multiple strains can be challenging due to the strain interactions. It is uncertain how strain interactions play a role in the persistence of high strain diversity in endemic settings and what that implied for future interventions. As my first research question, I focused on “What role do within-host dynamics play in maintaining high diversity of pathogen strains?”. I developed an individual-based model with a synthetic population representing the characteristics of Indigenous communities in Australia. I discovered that within-host competition among strains can impact epidemiological dynamics. My findings revealed that within-host and between-host competition reduces strain diversity when operating independently. However, when they function together, they could significantly increase the diversity of strains. My model suggested that an intervention that reduces the transmission of all the strains had the potential to later increase the level of pathogen diversity, complicating the efficacy of further interventions. In addition, I discussed how this modelling framework can be adapted to investigate the impact of GAS strain interactions on population-level dynamics. To apply mass drug administrations in the areas that need them the most, it is essential to estimate the prevalence of scabies at the community level. Currently, there is no standardisation of approaches to estimate scabies prevalence. Given that prevalence and transmission mechanisms differ among communities, there is a need to thoroughly understand how sampling procedures aiming to asses scabies prevalence interplay with epidemiology. As my second research question, I focused on “Which sampling strategies - individual, household, or school-based - are most effective for estimating the prevalence of scabies in a population?”. I developed another computational model and explored the effectiveness of sampling methods to estimate prevalence of scabies in remote Indigenous communities in Australia. I found that when there is an underlying household-specific heterogeneity in scabies prevalence, the household sampling approach introduces more variance than simple random sampling. I concluded that while the simple random sampling approach seems to be more effective than other sampling methods in estimating scabies prevalence, the efficacy of surveillance strategies depends on how prevalence is distributed within the community. In addition, I built a table for use of future surveillance studies to estimate the sampling percentage based on population size, an accuracy threshold and a priori knowledge of scabies prevalence. To reduce scabies burden in communities with high endemic levels of scabies, three to five annual mass drug administration rounds are recommended by the experts convened by the World Health Organization. Because current guidelines are only based on expert opinions, WHO recommends quantitative evaluations to assess the likely efficacy of MDA recommendations. As my third research question, I focused on “Which mass drug administration strategy is most effective for controlling scabies?”. I developed an individual-based model to evaluate the efficacy of mass drug administration strategies in decreasing burden of scabies in Liberia. I found that while MDAs can be effective in the short and medium term, prevalence will rise over longer time periods until it reaches pre-MDA levels. The modelling results also indicated that low level of scabies prevalence can be sustained long-term when MDAs are combined with behavioral and systemic changes, such as improvements in education and access to the health care system, that shorten the time involved in effective scabies treatment. In this thesis, I conclude that understanding the complex dynamics of skin pathogens remains a challenging problem because of the heterogeneities in host and pathogen populations. While this thesis provides practical results for controlling skin pathogens, it also highlights the need for developing pathogen-specific and community-specific models to reduce the burden of skin pathogens.
  • Item
    Thumbnail Image
    Generalization Lessons from Biomedical Relation Extraction using Pretrained Transformer Models
    Elangovan, Aparna ( 2023-12)
    Curating structured knowledge for storing in biomedical knowledge databases, requires human experts to annotate relationships, thus making maintenance of these databases expensive and difficult to scale to the large quantities of information presented in scientific publications. It is challenging to ensure that the information is comprehensive and up-to-date. Hence, we investigate the generalization capabilities of state-of-the-art natural language processing (NLP) techniques to automate relation extraction to aid human curation. In NLP, deep learning-based architectures, in particular pretrained transformer models with millions of parameters enabling them to achieve state of the art (SOTA) performance, have been dominating leaderboards on public benchmark datasets, usually achieved by fine-tuning pretrained transformer models on the target dataset task. In our research, we investigate the generalizability of such SOTA models – fine-tuned pretrained transformer models – in biomedical relation extraction for real-world applications where the performance expectations of these models need to be applicable beyond the official test sets. While our experiments focus on the current SOTA models, our findings have broader implications on generalization of NLP models and their performance evaluations. We ask the following research questions: 1. How generalizable are fine-tuned pretrained transformer models in biomedical relation extraction? 2. What factors lead to poor generalizability despite high test set performance of fine-tuned pretrained transformer models? 3. How can we improve qualitative aspects of the training data to improve real-world generalization performance of fine-tuned pretrained transformer models? The contributions are: 1) We identify a large performance gap compared to the test set when a SOTA fine-tuned pretrained transformer model is applied at large scale. This substantial generalization gap has neither been verified nor reported in prior large scale biomedical relation extraction studies. 2) We identify that high similarity between training and test sets, even with random splits, can result in inflated performance measurements. We suggest stratifying the test set based on similarity to the training set to provide a more effective interpretation of the results to understand memorization versus generalization capabilities of a model. Furthermore, we also find that the fine-tuned pretrained transformer models appear to rely on spurious correlations that are present in both training and test sets, obtaining inflated test set performance. 3) We also find that, for a given quantity of training data, qualitative aspects can boost performance when fine-tuning pretrained transformers. More specifically, we find that incorporating training samples that are quite similar to one another but have different ground truth labels – we call them human-adversarials – in low to moderate proportions can boost generalization performance by up to 20 points on fine-tuned pretrained transformer models such as BERT, BioBERT and RoBERTa. On the other hand, training samples that are quite similar to one another and have the same ground truth labels – we call them human-affables – can potentially degrade generalization performance. We thus demonstrate that merely aiming for higher quantities of training data is not sufficient to improve generalization. 4) As a result of our findings 1 & 2, we propose to the NLP community that confirming linguistic capabilities as the cause of performance gains even within the context of the test set is crucial to generalization, adapting generalization principles from clinical studies. We thus advocate for effective test sets & evaluation strategies, including adapting concepts such as randomized controlled trials from clinical studies for NLP to establish causation, as our experiments demonstrate how a test set constructed using the standard practice of random splits may not be sufficient to measure generalization capabilities of a model. Overall, in this thesis, we closely examine model generalization and aim to strengthen how machine learning models are evaluated. While we do so in the context of biomedical relation extraction, where generalizability is critical – our findings are applicable to the entire field of machine learning model evaluation in NLP.