- Computing and Information Systems - Theses
Computing and Information Systems - Theses
Permanent URI for this collection
Search Results
Now showing
1 - 10 of 408
-
ItemAutomatic Quality Assurance for Biological DatabasesChen, Jiyu ( 2023-08)Modern biological research relies on gene-centric biological databases, particularly gene ontology annotation resources. These resources provide functional annotations of genes using standardized Gene Ontology vocabulary. However, the presence of quality issues within gene ontology annotation poses challenges to biological research tasks and leads to cascading errors in databases. The existing manual quality assurance approach is labor-intensive and cannot efficiently scale up to handle the sheer volume of gene ontology annotations. To address this, I investigate automating quality assurance for gene ontology annotation and propose a solution through the development of a prototype automatic inconsistency detection system. In response to the research questions, the typology of inconsistencies in gene ontology annotation is defined (RQ1). Followed by the establishment of a novel automation framework, employing natural language processing algorithms in a human-in-the-loop setting to detect typed semantic inconsistencies (RQ2). Multiple approaches for the integration of background knowledge for improved detection are developed (RQ3). Strategies for interpreting detected results are proposed, with the potential to be utilized as standards for evaluating automatic database assurance systems covering broad quality issues (RQ4). The study's contributions extend beyond gene ontology annotation, holding broad implications for quality assurance in diverse biological domains, including protein function prediction, disease association studies, pathway analysis, and more.
-
ItemEfficient interval reasoning for floating-point arithmeticNazecic-Andrlon, Mak ( 2023-06)Reasoning about floating-point numbers is notoriously difficult, since rounding arithmetic operations destroys or weakens their algebraic properties (e.g. associativity). As such, available interval propagation methods often exhibit slow convergence even on simple problems. In this work, we develop new theoretical results allowing us to efficiently compute optimal---in the sense that any subinterval omits at least one solution---bounding intervals for the arguments of floating-point sums, products, and quotients. This allows us to directly solve equations of the form f(x, y) = z, where the variables are bounded by intervals and f is one of rounded addition, subtraction, multiplication, or division, and thus solve the convergence problem.
-
ItemLearning Spatial Indices EfficientlyLiu, Guanli ( 2023-06)Machine learning and database management systems have been extensively researched for many years. A database management system, while supporting data persistence and stable queries, can often encounter efficiency issues when building indices for querying big data. In recent years, machine learning is gradually used in databases to solve efficiency issues such as knob tuning, index selection, cost estimation, and index building. Most problems are solved by using regression models and reinforcement learning. Here, using machine learning techniques for index building leads to a new type of index structures named the learned index, which has shown better query performance than traditional indices, e.g., B-trees and R-trees. However, the efficiency of building learned indices is still a key challenge, especially when facing large-scale datasets. The reason is that index building is based on the whole dataset, which requires a full dataset scan in each epoch, and there is at least one epoch for building a learned index. Existing learned indices suffer from this issue, which hinders the application of indices in the database management system. In this thesis, we address this efficiency issue for index learning over a special type of data, the spatial data, i.e., data associated with geographical location information, the volume of which is rapidly growing due to the prevalence of smart mobile devices, the Internet of Things, and 5G networks. We study four research problems on effective and efficient spatial data indexing using machine learning techniques. The first problem is an empirical study of two learned spatial indices RSMI and ZM, which support point, range, and kNN queries. These indices have reported better query performance than a traditional spatial index, the R-trees. However, there is no open-source code or extensive basis that supports testing and evaluating these learned indices against other traditional spatial indices for large real-world datasets. We address such an issue by offering an implementation of these learned indices. Based on the implementation, we present a thorough empirical analysis of the advantages and disadvantages of learned spatial indices, highlighting their significant build times and motivating our studies to optimize the build time efficiency of learned spatial indices. In the second research problem, we address the efficiency issue in learning spatial indices by proposing an index learning framework called ELSI. The key idea of ELSI is that learning from a smaller dataset can derive similar query performance to that using the full input dataset. Experiments on real datasets with over 100 million points show that ELSI can reduce the build times of four different learned spatial indices consistently and by up to two orders of magnitude without jeopardizing query efficiency. In the third research problem, we propose to pre-train index models offline and only fine-tune them online for index learning to accelerate the building of learned (spatial) indices. The results show that we improve the build time of learned one-dimensional indices by 30.4\% and improve lookup efficiency by up to 24.4\% on real datasets and 22.5\% on skewed synthetic datasets. When this technique is applied to spatial data, it speeds up learned spatial index building by two orders of magnitude, while the lookup efficiency can also be increased by up to 13\%. While learned spatial indices have shown strong query performance, their structure and query algorithms differ drastically from the traditional indices, which are well supported by off-the-shelf database systems. To reduce the additional overhead of replacing traditional indices with learned spatial indices, in the fourth research problem, we study applying learning-based techniques to optimize the structure of traditional spatial indices. We focus on indices based on space-filling curves (SFC). SFCs are a classic technique to transform multidimensional (e.g., spatial) data into one-dimensional values for data indexing. The choice of SFC for indexing over a particular dataset and query workload has a significant impact on the query performance of the resultant index. We propose algorithms that enable estimating the query performance of an SFC-based index without actually building the index, thus enabling efficient computing of a query-optimized SFC-based index. Experiments show that our cost estimation algorithms are over an order of magnitude faster than naive methods. The computed SFC indices outperform competing indices based on two classic types of SFCs, i.e., Z-curves and Hilbert curves, in nearly all settings considered.
-
ItemSource-Free Transductive Transfer Learning for Structured PredictionKurniawan, Kemal Maulana ( 2023-07)Current transfer learning approaches require two strong assumptions: the source domain data is available and the target domain has labelled data. These assumptions are problematic when both the source domain data is private and the target domain has no labelled data. Thus, we consider the source-free unsupervised transfer setup in which the assumptions are violated across both languages and domains (genres). To transfer structured prediction models in the source-free setting, we propose two methods: Parsimonious Parser Transfer (PPT) designed for single-source transfer of dependency parsers across languages, and PPTX which is the multi-source version of PPT. Both methods outperform baselines. We then propose to improve PPTX with logarithmic opinion pooling (PPTX-LOP), and find that it is an effective multi-source transfer method for structured prediction in general. Next, we study if our proposed source-free transfer methods provide improvements when pretrained language models (PTLMs) are employed. We first propose Parsimonious Transfer for Sequence Tagging (PTST) which is a variation of PPT designed for sequence tagging. Then, we evaluate PTST and PPTX-LOP on domain adaptation of semantic tasks using PTLMs. We show that for globally normalised models, PTST and PPTX-LOP improve precision and recall respectively. Besides unlabelled data, the target domain may have models trained on various tasks (but not the task of interest). To investigate if these models can be used successfully to improve performance in source-free transfer, we propose two methods. We find that leveraging these models can improve recall over direct transfer with one of the proposed methods. Finally, we critically discuss and conclude the findings in this thesis. We cover relevant subsequent work and close with a discussion on limitations and future work.
-
ItemAssessing and Improving Fairness in Models of Human LanguageHan, Xudong ( 2023-06)Models of human language often learn and amplify dataset biases, leading to discrimination such as opportunity inequality in job applications. In this thesis, we aim to assess and improve fairness in natural language processing (NLP). Overall, we contribute to the field of fairness in NLP by offering guidance on designing fairness metrics to assess fairness, mitigating bias in text representations and training datasets to improve fairness, and developing tools to benchmark fairness research. The first key contribution of the thesis is assessment: how to quantify fairness. While several metrics have been proposed to evaluate fairness based on different assumptions about the nature of fairness, there is little work on clarifying what those assumptions are, or on guiding the selection of evaluation metrics from the fundamental understanding of fairness. To address this, we propose a generalized aggregation framework to illustrate the assumptions about fairness corresponding to a set of existing fairness metrics. Based on the understanding of these assumptions, we then provide recommendations to standardize further research on fairness. We also present two novel metrics to quantitatively measure the performance--fairness trade-offs, which are shown to be essential for systematic model selection and comparison. Then we propose to improve fairness by mitigating bias in training datasets. We hypothesize that models trained over debiased datasets should result in fairer predictions. To test this, we first adopt long-tail learning methods for dataset bias mitigation and show that our novel balanced training method substantially improves fairness. Together with other existing debiasing methods, the dataset debiasing method is then employed in conjunction with dataset distillation in our proposed framework, resulting in better fairness at reduced training cost. NLP models are also shown to unintentionally encode protected information even if such attributes are not provided as explicit inputs in the model training (i.e. protected information leakage). We propose novel adversarial learning methods to mitigate leakage by removing protected information from text representations. In particular, we employ an ensemble training algorithm for adversarial learning to improve the stability of adversarial learning and introduce additional constraints of orthogonality between adversaries to increase representational fairness. Not only can the proposed method improve fairness over standard adversarial debiasing, it also improves performance--fairness trade-offs and stability simultaneously. A key assumption that most debiasing methods have made is that protected attributes associated with training instances are annotated in the dataset, which often does not hold in real-world applications. To address this problem, we decouple the training of discriminators and the main task model for adversarial debiasing and find that a small number of protected-labeled instances is sufficient for our proposed method to achieve comparable results with standard adversarial learning. We further improve on this method with a meta-algorithm to identify regions in the hidden space that consistently under/over-perform, which derives a clustering based on protected information leakage and training behaviour. Extensive experiments show that such cluster information can be used with existing supervised debiasing methods to mitigate bias without observing protected labels. Reproducing and evaluating these methods can be difficult despite remarkable achievements in assessing and improving fairness. Therefore, the last contribution of this thesis is introduced as FairLib, a unified framework for assessing and improving fairness. FairLib is an open-source Python library that can be used to assess benchmark datasets, assess and improve fairness with plenty of built-in methods, and compare and visualize results. One immediate benefit of using FairLib is to systematically evaluate debiasing methods, which have generally only been examined over a few benchmark datasets under narrow distributions. Because dataset bias plays such an important role in fairness, we additionally generalize the dataset debiasing methods to simulate different data conditions based on FairLib, conducting a systematic evaluation and establishing benchmarks for debiasing methods.
-
ItemCapturing Uncertainty in Ensemble Models For Human-Machine CollaborationMaadi, Mansoureh ( 2023-08)This thesis studies capturing uncertainty in ensemble models, starting with machine only models and then leading to human-machine collaboration from new perspectives. We have identified two research gaps in the previous studies on capturing uncertainty in ensemble models. First, in machine decision making, while several studies have been presented to introduce combining approaches to deal with inter-source uncertainty in ensemble models, capturing intra-source uncertainty needs to be addressed. Second, the collaboration of humans with machines in ensemble models introduces a new challenge of integrating human uncertainty within these models. Furthermore, when addressing real-world decision making problems through human-machine collaboration, dedicated research efforts are required to investigate this collaborative approach thoroughly. So, each case study in human-machine collaboration for ensemble models requires new approaches. This thesis delves into the first research gap in the context of ensemble classifiers. It studies this research gap in two settings. First, an interval modelling to combine classifiers in a category of ensemble models to capture inter-source and intra-source uncertainties is developed. Through the proposed model, the performance of the ensemble model can be improved by capturing uncertainty in complicated binary classification problems. Second, the ensemble selection problem for bagging as one of the widely used ensemble models in the literature is studied. While consideration of intra-source uncertainty as a selection criterion for classifiers in an ensemble has been previously overlooked, we show its substantial to enhance the performance of the selected ensemble. This study formulates the ensemble selection problem as a bi-objective optimisation problem for bagging and presents an adaptive meta-heuristic algorithm to solve the bi-objective problem. The findings highlight the significance of incorporating intra-source uncertainty into the classifier selection process, leading to improved ensemble model performance. Then we touch upon the second research gap on human-machine collaboration in ensemble models. Specifically, interval modelling approaches to capture uncertainty in ensemble models where both humans and machines make decisions are presented. Through two case studies, we show how this collaboration and interval modelling enhance ensemble performance, by capturing uncertainties from both humans and machines. In the first case study, synthetic data is used to show how human-machine collaboration and capturing the uncertainty of humans and machines can be conducted throughout the ensemble decision making. In the second case study, a real image dataset related to biofouling assessment is utilised to investigate the importance of human-machine collaboration in ensemble models for biofouling detection. Furthermore, interval modelling to examine how capturing uncertainty affects ensemble performance is employed for biofouling detection. In summary, as we explained, this thesis advances the current state of ensemble models by presenting effective interval modelling techniques to capture uncertainty and investigates human-machine collaboration within these models. These contributions aim to enhance ensemble performance, a critical step before implementing these models to deal with real-world decision making problems.
-
ItemEfficient Algorithms for Safer Traffic Management in the Era of Connected Autonomous VehiclesMuthugama, Thibbotuwawa Muthugamage Tharindu Lakmal ( 2023-05)Safer traffic management is essential to reduce road crashes and their adverse consequences. The recent advancements in connectivity and autonomy of the traffic ecosystem enable utilizing sophisticated techniques to optimize traffic conditions in real-time, considering various goals. Although many studies adopted these technological advancements, they mainly optimize traffic efficiency by facilitating faster vehicle operations. However, to manage hazardous scenarios created by human error, technical failures or environmental conditions, vehicles need proactive mechanisms that reduce the chances of a crash. One such proactive approach is, maximizing the space and time between vehicles to facilitate hazard perception and graceful emergency interventions. However, such spacing optimization should be cautious of travel time degradations to maintain traffic efficiency. This requirement to balance two objectives (i.e. traffic safety and efficiency) alongside vehicle motion and spacing considerations makes the problem significantly challenging. Thus, this thesis investigates proactive road network-level safety optimization solutions for the Connected Autonomous Vehicles (CAVs) era. The spatio-temporal data from vehicles help generate a safer traffic plan by determining optimum inter-vehicular spacing levels over time. Therefore, we first present a novel temporal graph called Platooning Graph(PG), which represents inter-vehicular spacing levels over time using graph edges and help control vehicles. We call the PG equivalent to a safety optimum motion plan as Safest PG (SPG) and propose a heuristic algorithm to approximate SPG from its large solution space. The algorithm starts by representing overlaps/conflicts between vehicle routes and deciding the precedence of vehicles on their arrival order. Then it uses the vehicle precedence information with a car-following model to space out vehicles iteratively to derive the SPG. Using microscopic traffic simulations, we show the effectiveness of the SPG in optimizing safety with a minimum impact on traffic efficiency. SPG captures the elements fundamental to our optimization problem and has reasonable processing time with smaller traffic volumes. However, it is not sustainable for more frequent computing requirements in handling larger traffic volumes in real-time scenarios. To handle a considerable traffic volume in real-time, we present a framework called Conflict Zone Graph based Motion Planning (CZMP), which has three components. The first component, the Conflict Zone Graph (CZG), maintains comprehensive information that helps avoid heavy-weight temporal information of SPG. The second component plans delays for vehicles to distribute available spatio-temporal spaces in the road network effectively among traffic. The delays modify the arrival time of vehicles to the conflicting locations and result in a more safety-oriented but efficient vehicle precedence order compared to the SPG's simple vehicle arrival-based precedence policy. The third component plans the vehicle motions following the obtained delays and other information in CZG. The motion planning algorithm uses a direct technique and avoids the iterative process in SPG, reducing the processing time significantly. Our framework modularizes each step in the entire workflow for better usability and extendability. Using microscopic traffic simulations, we show the effectiveness and efficiency of the CZMP framework for real-time handling of a reasonably higher traffic volume. However, the centralized computing architecture and the complete dependence on online optimization procedures limit the scalability of the CZMP framework to large road networks. To handle road traffic in large road networks in a scalable manner, we consider an Autonomous Intersection Management (AIM) approach, which coordinates the CAV trajectories around each intersection. Although many AIM solutions exist in the literature, they are efficiency-focused and are vulnerable to severe safety issues such as queue spillbacks. Therefore, we propose Deep Reinforcement Learning (Deep-RL) based Safer AIM (S-AIM) solution, which could optimize traffic safety and flow concurrently in real-time while handling queue spillbacks. Deep-RL helps S-AIM efficiently pick traffic control actions online based on a policy learnt by simulating many scenarios offline, reducing the S-AIM's dependence on online procedures and improving scalability. Also, S-AIM uses a Deep-RL agent architecture of two levels (i.e. intersection and vehicle levels) to further increase the scalability. The intersection-level agents generate safer vehicle trajectory plans by combining a novel queue polling mechanism with a road capacity aware queue spillback handling mechanism and a cubic Bezier curves based Deep-RL policy. The vehicle-level Deep-RL agents control the actual vehicle trajectories to match those plans. We compare our solution S-AIM against state-of-the-art solutions using realistic microscopic traffic simulations and show its effectiveness in achieving a high safety level with a minimum impact on travel time as the final contribution of this thesis. The research outcome of this thesis shows the ability to optimize the traffic safety of road networks with a minimum impact on travel time by processing the spatio-temporal data available from CAVs. Our contributions further show the ability to develop efficient algorithms for real-time optimization and scenarios in large-scale road networks. Most importantly, the findings in our work provide a fundamental theoretical background to similar use cases in transportation and a range of other application domains.
-
ItemDesigning Proactive Smart Speakers and Their Application in Voice SurveysWei, Jing ( 2023-05)Smart speakers are an emerging technology in recent years. They provide an interface that allows users to request information or control smart devices via natural languages. Most existing smart speakers exhibit a passive interaction mode. They need to be activated by voice commands first before acting upon user requests. This passive way of interaction limits the potential application scenarios of smart speakers. If smart speakers can proactively approach users, they can actively engage users through offering suggestions, sending well-timed reminders, and assisting users to collect self-reports for reflection by conversations. Hereby, this thesis investigates the implementation of proactive smart speakers and their use case in data collections through voice surveys. We build a proactive smart speaker prototype for this research and investigate the opportune moments and user perception of proactive interactions. We further explore the interaction errors and user obstacles with proactive smart speakers through an in-depth quantitative analysis of various in-situ interaction data. Drawing upon our own findings and prior work, we deploy voice surveys on smart speakers and evaluate the reliability and validity of different question types. Through building a proactive smart speaker prototype, evaluating user experience in the wild, and developing a voice survey application, this thesis presents findings from empirical data of proactive smart speaker and voice applications and discusses corresponding design implications.
-
ItemAn empirical analysis of the systemic performance of decentralised cryptocurrency systemsYang, Renlord Nicholas ( 2022-11)Decentralised cryptocurrency networks have gained significant adoption over the years and attracted significant user interest. These networks operate on a peer-to-peer on the protocol level with minimal trust assumptions. Networks such as Bitcoin and Ethereum require consensus participants to maintain a full copy of the blockchain to be able to derive state to verify user transactions. Due to surging demand and popularity, these networks face scalability issues when processing transactions as these decentralised networks are throughput bound by the consensus participant that can provide the least throughput. Central to the health of decentralised cryptocurrency network is its security, scalability, a decentralisation state where no central authority is relied upon and minimal trust assumptions. However, to maintain the above tenets of the network, user participation is key. The thesis is centred on the hypothesis that user participation is hindered by barriers of entry imposed by high compute resource requirement when users need to bootstrap their clients in order to participate in the network with minimal trust assumptions. To that end, the body of our research investigated in detail where compute resources are spent and how client implementations can be further improved to reduce compute resources or at least yield higher transaction verification throughput during the bootstrap process to reduce the time taken for users to complete the bootstrap process. First, we analysed how state synchronisation works under the hood formally and discussed the trade-offs of various implementation-specific state synchronisation optimisation techniques. In our measurement and analysis, we identified that performance decays as state data increases which threatens the scalability of the network comprised of independent node operators. Our work measures and shows that despite state and blockchain size grows in linear pace, run time performance degrades in an exponential scale as users verify towards the end of the block tip. We concluded that under the current trust models, it is infeasible to apply system optimisation that scale well with the increasing workload incurred due to unbounded growth of the blockchain. Secondly, we thoroughly studied the Ethereum Gas Mechanism. We collected detailed collection of EVM traces for each transaction, with opcode execution times, measured on two disparate machines with defined specifications. Then, identified that some EVM opcodes are mispriced, particularly those involving disk I/O, presenting potential denial-of-service vectors. Subsequently, we analysed the effects of I/O and the impact of in-memory caching strategies employed by the reference client to mitigate I/O costs. We show that the current gas cost model favours more performant machines, which we identify as an ongoing threat to node diversity and decentralisation. We confirmed potential denial-of-service vectors resulting from mis-priced EVM opcodes involving disk I/O. We also measured the effects of I/O in transaction costs, finding that it remains poorly captured by the current Gas cost model and that its effects are getting worse over time. We also found that under the current Gas cost model, nodes with modest computational resources are disadvantaged compared to their better resourced peers, which we identified as an ongoing threat to node diversity and network decentralisation. In both aforementioned works, we also presented proposed improvements in client implementation that should lead to improved performance outcomes. Finally, we present a model of the stateful transaction verification and outline the requisite building blocks that make up the verification model. With the model, we proposed a scheme to do transaction verification without relying a state database locally. Then we discussed the computational feasibility of replacing I/O-driven state querying from a local authenticated database against transaction verification with verifiable state verification in transactions.
-
ItemA Multi-Perspective Analysis of Personal Informatics EcologiesAladwan, Ahed ( 2023-05)Computing visions promised invisible, engaging, and collective computing. These visions primarily describe a computing world where different artefacts work together in seamless and engaging ways and work collectively to achieve users’ goals. Yet, users still face different problems integrating and interconnecting different artefacts consistently. Such inconsistency prevents the proper exchange of information across these artefacts, thus hindering their ability to be successful partners in achieving users’ goals. This thesis explores the problem of integration and interconnectivity across a system where artefacts need to operate in collective ways. It aims to unpack how integration and interconnectivity operate within such a system. This system is a personal informatics ecology for physical activity. This thesis investigates such an ecology using a multi-perspective approach. Such a multi-perspective approach examines the ecology at the artefact, ecology, and data levels. It utilises qualitative approaches such as content analysis, qualitative structural analysis, interpretative phenomenological analysis, and thematic analysis. Three studies employ these qualitative approaches to examine how users structure, experience, and interact with a personal informatics ecology for physical activity. This thesis makes four key contributions to the field of Human-Computer Interaction (HCI). First, it contributes an in-depth understanding of the user journey structuring and using their ecology. Second, it contributes two conceptual frameworks, one at the artefact level and the second at the ecology and data levels. These conceptual frameworks describe key structural, experiential, and interactional aspects of the users’ engagement with the ecology. These frameworks are also the basis for further developing a known ecological concept in HCI –an ecological layer. Third, it extends the definition of integration and interconnectivity to include the artefact’s relation, role, and position concerning other artefacts within the ecology. Fourth, it recommends unifying the different ecological terms using one term instead of a plethora of terms. This term is ecology. To examine ecologies, it is possible to employ different analytical lenses such as structural, informational, interactional, experiential, and socio-technical. These contributions respond to the challenge of integration and connectivity as an enabler of HCI computing visions.