Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 30
  • Item
    Thumbnail Image
    Learning Spatial Indices Efficiently
    Liu, Guanli ( 2023-06)
    Machine learning and database management systems have been extensively researched for many years. A database management system, while supporting data persistence and stable queries, can often encounter efficiency issues when building indices for querying big data. In recent years, machine learning is gradually used in databases to solve efficiency issues such as knob tuning, index selection, cost estimation, and index building. Most problems are solved by using regression models and reinforcement learning. Here, using machine learning techniques for index building leads to a new type of index structures named the learned index, which has shown better query performance than traditional indices, e.g., B-trees and R-trees. However, the efficiency of building learned indices is still a key challenge, especially when facing large-scale datasets. The reason is that index building is based on the whole dataset, which requires a full dataset scan in each epoch, and there is at least one epoch for building a learned index. Existing learned indices suffer from this issue, which hinders the application of indices in the database management system. In this thesis, we address this efficiency issue for index learning over a special type of data, the spatial data, i.e., data associated with geographical location information, the volume of which is rapidly growing due to the prevalence of smart mobile devices, the Internet of Things, and 5G networks. We study four research problems on effective and efficient spatial data indexing using machine learning techniques. The first problem is an empirical study of two learned spatial indices RSMI and ZM, which support point, range, and kNN queries. These indices have reported better query performance than a traditional spatial index, the R-trees. However, there is no open-source code or extensive basis that supports testing and evaluating these learned indices against other traditional spatial indices for large real-world datasets. We address such an issue by offering an implementation of these learned indices. Based on the implementation, we present a thorough empirical analysis of the advantages and disadvantages of learned spatial indices, highlighting their significant build times and motivating our studies to optimize the build time efficiency of learned spatial indices. In the second research problem, we address the efficiency issue in learning spatial indices by proposing an index learning framework called ELSI. The key idea of ELSI is that learning from a smaller dataset can derive similar query performance to that using the full input dataset. Experiments on real datasets with over 100 million points show that ELSI can reduce the build times of four different learned spatial indices consistently and by up to two orders of magnitude without jeopardizing query efficiency. In the third research problem, we propose to pre-train index models offline and only fine-tune them online for index learning to accelerate the building of learned (spatial) indices. The results show that we improve the build time of learned one-dimensional indices by 30.4\% and improve lookup efficiency by up to 24.4\% on real datasets and 22.5\% on skewed synthetic datasets. When this technique is applied to spatial data, it speeds up learned spatial index building by two orders of magnitude, while the lookup efficiency can also be increased by up to 13\%. While learned spatial indices have shown strong query performance, their structure and query algorithms differ drastically from the traditional indices, which are well supported by off-the-shelf database systems. To reduce the additional overhead of replacing traditional indices with learned spatial indices, in the fourth research problem, we study applying learning-based techniques to optimize the structure of traditional spatial indices. We focus on indices based on space-filling curves (SFC). SFCs are a classic technique to transform multidimensional (e.g., spatial) data into one-dimensional values for data indexing. The choice of SFC for indexing over a particular dataset and query workload has a significant impact on the query performance of the resultant index. We propose algorithms that enable estimating the query performance of an SFC-based index without actually building the index, thus enabling efficient computing of a query-optimized SFC-based index. Experiments show that our cost estimation algorithms are over an order of magnitude faster than naive methods. The computed SFC indices outperform competing indices based on two classic types of SFCs, i.e., Z-curves and Hilbert curves, in nearly all settings considered.
  • Item
    Thumbnail Image
    Towards Robust Medical Machine Learning
    He, Jiabo ( 2022)
    Machine learning systems have been developed to address a number of problems in numerous domains, among which medical solutions have been facilitated by machine learning approaches for decades. These approaches play important roles in automated disease diagnosis, medical image processing, and auxiliary surgical operation, etc. Despite the highly efficient diagnosis benefited from machine learning approaches, these methods may not be robust to common challenges in practical scenarios, such as special while crucial characteristics of medical data, annotation variations from multiple experts, noisy annotations, and multi-source datasets. Such problems impede machine learning methods from being applied accurately and safely to medical tasks. In the thesis, we introduce special while important medical problems that were not brought into the spotlight before. We then provide corresponding robust machine learning solutions for each problem when existing machine learning methods degrade significantly in these tasks. Specifically, the first problem is the similarity analysis for time series with large discontinuities, which is common in surgical time series. We thus propose a robust distance measurement for time series with large discontinuities when they disable the accurate measurement of local characteristics using existing algorithms. Second, surgical policies provided by different surgeons for the same patient/surgery may not be exactly the same. We then propose the reward-penalty Dice loss (RPDL) to learn non-unique surgical segmentation regions for deep vision networks. RPDL is robust to varying annotations for the same input, which enables the comprehensive learning of models from multiple experts. Third, medical datasets might be composed of limited examples and noisy annotations, making it challenging to train deep learning models. To address this challenge, we propose alpha-IoU, a family of power Intersection over Union (IoU) losses for bounding box (bbox) regression. We show that alpha-IoU losses are more robust to small datasets and noisy bboxes in lesion detection. Fourth, large-scale medical datasets are often collected from different institutions cooperated by a number of experts. In this case, we build a one-stage framework SpineOne for detecting degenerative discs and vertebrae from spinal MRIs, which implements both the keypoint localization and classification tasks simultaneously. SpineOne is a robust detector to multi-source MRI slices with various scales, numbers and quality. All four proposed machine learning approaches outperform existing baselines by a noticeable margin in specific medical tasks. In summary, four medical issues are thoroughly investigated in the thesis, i.e., the distance measurement for surgical time series with large discontinuities, the surgical region segmentation with a variety of clinician annotations, the lesion detection with limited examples and noisy bboxes, and the anatomical keypoint detection with multi-source medical data. Towards more robust medical machine learning, we then propose one robust machine learning approach for each corresponding problem.
  • Item
    Thumbnail Image
    Learning to generalise through features
    Grebenyuk, Dmitry ( 2020)
    A Markov decision process (MDP) cannot be used for learning end-to-end control policies in Reinforcement Learning when the dimension of the feature vectors changes from one trial to the next. For example, this difference is present in an environment where the number of blocks to manipulate can vary. Because we cannot learn a different policy for each number of blocks, we suggest framing the problem as a POMDP instead of the MDP. It allows us to construct a constant observation space for a dynamic state space. There are two ways we can achieve such construction. First, we can design a hand-crafted set of observations for a particular problem. However, that set cannot be readily transferred to another problem, and it often requires domain-dependent knowledge. On the other hand, a set of observations can be deduced from visual observations. This approach is universal, and it allows us to easily incorporate the geometry of the problem into the observations, which can be challenging to hard-code in the former method. In this Thesis, we examine both of these methods. Our goal is to learn policies that can be generalised to new tasks. First, we show that a more general observation space can improve the performance of policies tested in untrained tasks. Second, we show that meaningful feature vectors can be obtained from visual observations. If properly regularised, these vectors can reflect the spacial structure of the state space and used for planning. Using these vectors, we construct an auto-generated reward function, able to learn working policies.
  • Item
    Thumbnail Image
    Ontologies in neuroscience and their application in processing questions
    Eshghishargh, Aref ( 2019)
    Neuroscience is a vast, multi-dimensional and complex field of study based on both its medical importance and unresolved issues regarding how brain and the nervous system work. This is because of the huge amount of brain disorders and their burden on people and society. Furthermore, scientist have been excited about the function and structure of brain, ever since it was discovered to be responsible for all our emotions, thoughts and behaviour. Ontologies are concepts whose origins go back to philosophy and the concern with the nature and relation of being. They have emerged as promising tools for assistance with neuroscience research recently and provide additional data on a field of study. They connect each entity or element to other ones through descriptive relationships. Ontologies seem to suit the complex, multi-dimensional and still incomplete nature of neuroscience very well because of their characteristics. The first study shines light on applications of ontologies in neuroscience. It incorporated a systematic literature review and methodically reviewed over 1000 research papers from eight databases and three journals. After scanning all documents, 208 of them were selected. Then, a full text analysis was performed on the selected documents. This study found eight major applications for ontologies in neuroscience, most of them consisted of several subcategories. The analysis not only demonstrated the current applications of ontologies in neuroscience, but also their potential future in this field. The second study was set to represent neuroscience questions and then, classify them using ontologies. For this purpose, a questions set was gathered from two research teams and analysed. This, results in a set of dimensions which represents questions. Then, a question hierarchy was formed based on dimensions and questions were classified according to that hierarchy. Two different approaches were used for the classification including an ontology-based approach and a statistical approach. The ontology-based approach exceeded the statistical approach by 15.73% better classification results. The last study was designed to tackle and resolve questions with the assistance of ontologies. It first proposed a set of templates that acted as a translation mechanism for changing questions into machine readable code. Templates were based on the question hierarchy presented in the previous study. Second, this study created an integrated collection of resources including two domain ontologies (NIFSTD and NeuroFMA) and a neuroimaging annotation application (Freesurfer). Subsequently, the code created using templates was executed upon the integrated resource (knowledge base) to find the appropriate answer. While processing the questions, ontologies were used for disambiguation purposes too. At the end, all parts created in this study along with the question classification method created in the previous study were merged as different modules of a question processing model. In conclusion, this thesis reviewed all current ontology applications in neuroscience in detail and demonstrated the extent to which they can assist scientists in classifying and resolving questions. The results of this thesis show that applications of ontologies in neuroscience are diverse and cover a wide range; they are steadily becoming more used in this field; and they can be powerful semantic tools in performing different tasks in neuroscience.
  • Item
    Thumbnail Image
    Dauphin A Programming Language for Statistical Signal Processing - from principles to practice
    Kyprianou, Ross ( 2018)
    This dissertation describes the design and implementation of a new programming language called Dauphin for the signal processing domain. Dauphin's focus is on the primitive concepts and algorithmic structures of signal processing. In this language, random variables and probability distributions are as fundamental and easy to use as the numeric types of other languages. The basic algorithms of signal processing --- estimation, detection, classification and so on --- become the standard function calls. Too much time is expended by researchers in re-writing these basic algorithms for each application. Dauphin allows you to code these algorithms directly, so they can be coded once and put into libraries for future use. Ultimately, Dauphin aims to extend the power of the researcher by allowing them to focus on the real problems and simplify the process of implementing their ideas. The first half of this dissertation describes Dauphin and the design issues of existing languages used for signal processing that motivated its development. It includes a general investigation into programming language design and the identification of specific design criteria that impact signal processing programming. These criteria directed the features in Dauphin that support writing signal processing algorithms. Of equal importance, the criteria also provide a means to compare, with some objectivity, the suitability of different languages for signal processing. Following the discussion on language design, Dauphin's features are described in detail, then details related to Dauphin's implementation are presented, including a description of Dauphin's semantics and type system. The second half of the dissertation presents practical applications of the Dauphin language, focussing on three broad areas associated with signal processing: classification, estimation and Monte Carlo methods. These non-trivial applications, combined with examples throughout the dissertation, demonstrate that Dauphin is simple and natural to use, easy to learn and has sufficient expressiveness for general programming in the signal processing domain.
  • Item
    Thumbnail Image
    The use of clinical decision support systems for the development of medical students’ diagnostic reasoning skills
    Khumrin, Piyapong ( 2018)
    Computer-aided learning systems (e-learning systems) can help medical students gain more experience with diagnostic reasoning and decision-making. Within this context, providing feedback that matches student needs (i.e. personalised feedback) is both critical and challenging. Prior research showed that using Clinical Decision Support System (CDSS) to assist doctors improves the effectiveness and efficiency of diagnostic and treatment processes. However, the application of CDSS to the developmental process of clinical reasoning in a clinical teaching environment is still limited. In this research, we developed a new diagnostic decision support system embedded in a learning tool, called DrKnow. Students interact with twenty virtual patients to investigate though the learning steps similar to bedside teaching to arrive at a proper final diagnosis. DrKnow with CDSS-based design monitors students’ activities and provide personalised feedback to support students’ diagnostic decisions. We developed expert knowledge within DrKnow based on the machine learning models trained on 208 realworld clinical cases presenting with abdominal pain, to predict 5 diagnoses (appendicitis, gastroenteritis, urinary tract infection, ectopic pregnancy, and pelvic inflammatory disease). We assessed which of these models are likely to be most effective by predictive accuracy and clinical appropriateness when the model prediction was transformed to feedback. These models were leveraged to generate different kinds of feedback provided during along the process of decision making (interim feedback) and at the end of scenario (final feedback) based on the specific information requested by students from the virtual patients and their active diagnostic hypotheses. Students used this tool to explore one or more common clinical presentations, assessing patient histories, selecting and evaluating appropriate investigations and integrating these findings to select the most appropriate diagnosis. Based on the clinical information they request and prioritise, DrKnow presents key decision points and suggest three provisional diagnoses as they work through the virtual cases. Once students make a final diagnosis, DrKnow presents students with information about their overall diagnostic performance as well as recommendations for diagnosing similar cases. The analysis of the decisions of students as compared to those of DrKnow shows that DrKnow provided appropriate feedback on supporting students to select appropriate differential diagnoses and effective assessment of students diagnostic performance. Although DrKnow still has some limitations, we argue that the implementation of CDSS-based learning support for the development of diagnostic reasoning skills represented by DrKnow provides an effective learning process enabling positive student learning outcomes, while simultaneously overcoming the resource challenges of expert clinician supported bedside teaching.
  • Item
    Thumbnail Image
    Machine learning with adversarial perturbations and noisy labels
    Ma, Xingjun ( 2018)
    Machine learning models such as traditional random forests (RFs) and modern deep neural networks (DNNs) have been successfully used to solve complex learning problems in many applications such as speech recognition, image classification, face recognition, gaming agents and self-driving cars. For example, DNNs have demonstrated near or even surpassing human-level performance in image classification tasks. Despite their current success, these models are still vulnerable to noisy real-world situations where illegitimate or noisy data may exist to corrupt learning. Studies have shown that by adding small, human imperceptible (in the case of images) adversarial perturbations, normal samples can be perturbed into "adversarial examples'', and DNNs can be made to misclassify adversarial examples with a high level of confidence. This arouses security concerns when employing DNNs in security-sensitive applications such as fingerprint recognition, face verification and autonomous cars. Studies have also found that DNNs can overfit to noisy (incorrect) labels and as a result, generalize poorly. This has been one of the key challenges when applying DNNs in noisy real-world scenarios where even high-quality datasets tend to contain noisy labels. Another open question in machine learning is whether actionable knowledge (or "feedback") can be generated from prediction models to support decision making towards some long-term learning goals (for example, mastering a certain type of skills in a simulation-based learning (SBL) environment). We view the feedback generation problem from a new perspective of adversarial perturbation, and explore the possibility of using adversarial techniques to generate feedback. In this thesis, we investigate machine learning models including DNNs and RFs, and their learning behavior through the lens of adversarial perturbations and noisy labels, with the aim of achieving more secure and robust machine learning. We also explore the possibility of using adversarial techniques in a real-world application: to support skill acquisition in SBL environments through the provision of performance feedback. The first part of our work is on the investigation of DNNs and their vulnerability to adversarial perturbations, in the context of image classification. In contrast to existing work, we develop new understandings of adversarial perturbations by exploring DNN representation space with the Local Intrinsic Dimensionality (LID) measure. In particular, we characterize adversarial subspaces in the vicinity of adversarial examples using LID, and find that adversarial subspaces are of higher intrinsic dimensionality than normal data subspaces. We not only provide a theoretical explanation of the high dimensionality of adversarial subspaces, but also empirically demonstrate that such properties can be used to effectively discriminate adversarial examples generated using state-of-the-art attacking methods. The second part of our work is to explore the possibility of using adversarial techniques in a beneficial way to generate interactive feedback for intelligent tutoring in SBL environments. Feedback is actions (in the form of feature changes) generated from a pre-trained prediction model that can be delivered to a leaner in an SBL environment to correct mistakes or improve skills. We demonstrate that such feedback can be generated accurately and efficiently using properly constrained adversarial techniques with DNNs. In addition to DNNs, we also explore, in the third part of our work, adversarial feedback generation from RF models. Adversarial perturbations can be easily generated from DNNs using gradient descent and backpropagation, however, it is still an open question whether such perturbations can be generated from models such as RFs that do not work with gradients. This part of our work confirms that adversarial perturbations can also be crafted from RFs for the provision of feedback in SBL. In particular, we propose a perturbation method that can find the optimal space transition from one undesired class (e.g. 'novice') to the desired class (e.g. 'expert'), based on a geometric view of the RF decision space as overlapping high dimensional rectangles. We demonstrate empirically that our proposed method has high effectiveness as well as high efficiency when compared to existing methods, making it possible to be used for real-time feedback generation in SBL. The fourth part of our work focuses on DNNs and noisy label learning: training accurate DNNs on data with noisy labels. In this work, we investigate the learning behaviours of DNNs, and show that DNNs exhibit two distinct learning styles when trained on clean versus noisy labels. A LID-based characterization of the intrinsic dimensionality of DNN subspace (inspired by the first part of our work) allows us to identify the two stages of learning from dimensionality compression to dimensionality expansion on datasets with noisy labels. Based on the observation that dimensionality expansion is associated with overfitting to noisy labels, we further propose a heuristic learning strategy to avoid the later stage of dimensionality expansion, so as to robustly train DNNs in the presence of noisy labels. In summary, this work has contributed significantly to existing knowledge through: novel dimensional characterization of DNNs, effective discrimination of adversarial attacks, robust deep learning strategies against noisy labels, and novel approaches to feedback generation. All work is supported by theoretical analysis, empirical results and publications.
  • Item
    Thumbnail Image
    Towards highly accurate publication information extraction from academic homepages
    Zhang, Yiqing ( 2018)
    More and more researchers list their research profiles in academic homepages online. Publications from a researcher's academic homepage contain rich information, such as the researcher's fields of expertise, research interests, and collaboration network. Extracting publication information from academic homepages is an essential step in automatic profile analysis, which enables many applications such as academic search, bibliometrics and citation analysis. The publications extracted from academic homepages can also be a supplementary source for bibliographic databases. We investigate two publication extraction problems in this thesis: (i) Given an academic homepage, how can we precisely extract all the individual publication strings from the homepage? Here, a publication string is a text string that describes a publication record. We call this problem publication string extraction. (ii) Given a publication string, how can we extract different fields, such as publication authors, publication title, and publication venue, from the publication string? We call this problem publication field extraction. There are two types of traditional approaches to these two problems, rule-based approaches and machine learning based approaches. Rule-based approaches cannot accommodate the large variety of styles in the homepages, and they require significant efforts in rule designing. Machine learning based approaches rely on a large amount of high-quality training data as well as suitable model structures. To tackle these challenges, we first collect two datasets and annotate them manually. We propose a training data enhancement method to generate large sets of semi-real data for training our models. For the publication string extraction problem, we propose a PubSE model that can model the structure of a publication list in both line-level and webpage-level. For the publication field extraction problem, we propose an Adaptive Bi-LSTM-CRF model that can utilize the generated and the manually labeled training data to the full extent. Extensive experiment results show that the proposed methods outperform the state-of-the-art methods in the publication extraction problems studied.
  • Item
    Thumbnail Image
    Indoor localization supported by landmark graph and locomotion activity recognition
    Gu, Fuqiang ( 2018)
    Indoor localization is important for a variety of applications such as location-based services, mobile social networks, and emergency response. Although a number of indoor localization systems have been proposed in recent years, they have different limitations in terms of accuracy, cost, coverage, complexity, and applicability. In order to achieve a higher accuracy with relatively low cost, hybrid methods combining multiple positioning techniques have been used. However, hybrid methods usually require an infrastructure of beacons or transmitters, which may not be available in many environments or it may be available at a high cost. Spatial knowledge is available in many scenarios, and can be used to assist localization without additional cost. Landmarks are one of the spatial constraints useful for indoor localization. Indoor localization systems that use landmarks have been proposed in the literature, but they are usually applied for tracking robots by using laser scanners or/and cameras. The systems using these devices are economically or/and computationally expensive, and hence are not suitable for indoor pedestrian localization. Although landmarks based on the built-in smartphone sensors are also used in some indoor localization systems, the performance of these systems relies highly on the completeness of landmarks. A mismatch of landmarks may cause a large localization error and even lead to the failure of localization. The advent of sensor-equipped smart devices has enabled a variety of activity recognition and inference, including locomotion (e.g., walking, running, standing). The sensors built in the smart devices can capture the intensity and duration of activity, and even are able to sense the activity context. Such information can be used to enhance the localization accuracy or reduce the energy consumption and deployment cost while maintaining the accuracy. For example, the knowledge of locomotion activities can be used to optimize the step length estimation of people, which will contribute to the improvement of localization accuracy. However, it is challenging to precisely recognize activities related to indoor localization with smartphones. The hypothesis of this research is that accurate and reliable indoor localization can be achieved by fusing smartphone sensor data with locomotion activities and a landmark graph. This hypothesis is tested using the novel algorithms proposed and developed in this research. The proposed framework consists of four main phases, namely recognizing locomotion activities related to indoor localization from sensor data, improving the accuracy of step counting and step length estimation for pedestrian dead reckoning method, developing a landmark graph-based indoor localization method, and implementing quick WiFi fingerprint collection. The main contributions of this research are as follows. First, a novel method is proposed for locomotion activity recognition by automatically learning useful features from sensor data using a deep learning model. Second, robust and accurate algorithms are proposed for step counting and step length estimation to improve the performance of pedestrian dead reckoning, which will be fused with spatial information. Third, the concept of sensory landmarks and the landmark graph is proposed, and a landmark graph-based method is developed for indoor localization. Fourth, a practical, fast, and reliable fingerprint collection method is designed, which uses the landmark graph-based localization method for automatically estimating the location of reference points used to associate the collected fingerprints.
  • Item
    Thumbnail Image
    Analysing the interplay of location, language and links utilising geotagged Twitter content
    Afshin, Rahimi ( 2018)
    Language use and interactions on social media are geographically biased. In this work we utilise this bias in predictive models of user geolocation and lexical dialectology. User geolocation is an important component of applications such as personalised search and recommendation systems. We propose text-based and network-based geolocation models, and compare them over benchmark datasets yielding state-of-the- art performance. We also propose hybrid and joint text and network geolocation models that improve upon text or network only models and show that the joint models are able to achieve reasonable performance in minimal supervision scenarios, as often happens in real world datasets. Finally, we also propose the use of continuous representations of location, which enables regression modelling of geolocation and lexical dialectology. We show that our proposed data-driven lexical dialectology model provides qualitative insights in studying geographical lexical variation.