Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 3 of 3
  • Item
    Thumbnail Image
    Learning Spatial Indices Efficiently
    Liu, Guanli ( 2023-06)
    Machine learning and database management systems have been extensively researched for many years. A database management system, while supporting data persistence and stable queries, can often encounter efficiency issues when building indices for querying big data. In recent years, machine learning is gradually used in databases to solve efficiency issues such as knob tuning, index selection, cost estimation, and index building. Most problems are solved by using regression models and reinforcement learning. Here, using machine learning techniques for index building leads to a new type of index structures named the learned index, which has shown better query performance than traditional indices, e.g., B-trees and R-trees. However, the efficiency of building learned indices is still a key challenge, especially when facing large-scale datasets. The reason is that index building is based on the whole dataset, which requires a full dataset scan in each epoch, and there is at least one epoch for building a learned index. Existing learned indices suffer from this issue, which hinders the application of indices in the database management system. In this thesis, we address this efficiency issue for index learning over a special type of data, the spatial data, i.e., data associated with geographical location information, the volume of which is rapidly growing due to the prevalence of smart mobile devices, the Internet of Things, and 5G networks. We study four research problems on effective and efficient spatial data indexing using machine learning techniques. The first problem is an empirical study of two learned spatial indices RSMI and ZM, which support point, range, and kNN queries. These indices have reported better query performance than a traditional spatial index, the R-trees. However, there is no open-source code or extensive basis that supports testing and evaluating these learned indices against other traditional spatial indices for large real-world datasets. We address such an issue by offering an implementation of these learned indices. Based on the implementation, we present a thorough empirical analysis of the advantages and disadvantages of learned spatial indices, highlighting their significant build times and motivating our studies to optimize the build time efficiency of learned spatial indices. In the second research problem, we address the efficiency issue in learning spatial indices by proposing an index learning framework called ELSI. The key idea of ELSI is that learning from a smaller dataset can derive similar query performance to that using the full input dataset. Experiments on real datasets with over 100 million points show that ELSI can reduce the build times of four different learned spatial indices consistently and by up to two orders of magnitude without jeopardizing query efficiency. In the third research problem, we propose to pre-train index models offline and only fine-tune them online for index learning to accelerate the building of learned (spatial) indices. The results show that we improve the build time of learned one-dimensional indices by 30.4\% and improve lookup efficiency by up to 24.4\% on real datasets and 22.5\% on skewed synthetic datasets. When this technique is applied to spatial data, it speeds up learned spatial index building by two orders of magnitude, while the lookup efficiency can also be increased by up to 13\%. While learned spatial indices have shown strong query performance, their structure and query algorithms differ drastically from the traditional indices, which are well supported by off-the-shelf database systems. To reduce the additional overhead of replacing traditional indices with learned spatial indices, in the fourth research problem, we study applying learning-based techniques to optimize the structure of traditional spatial indices. We focus on indices based on space-filling curves (SFC). SFCs are a classic technique to transform multidimensional (e.g., spatial) data into one-dimensional values for data indexing. The choice of SFC for indexing over a particular dataset and query workload has a significant impact on the query performance of the resultant index. We propose algorithms that enable estimating the query performance of an SFC-based index without actually building the index, thus enabling efficient computing of a query-optimized SFC-based index. Experiments show that our cost estimation algorithms are over an order of magnitude faster than naive methods. The computed SFC indices outperform competing indices based on two classic types of SFCs, i.e., Z-curves and Hilbert curves, in nearly all settings considered.
  • Item
    Thumbnail Image
    Towards Robust Medical Machine Learning
    He, Jiabo ( 2022)
    Machine learning systems have been developed to address a number of problems in numerous domains, among which medical solutions have been facilitated by machine learning approaches for decades. These approaches play important roles in automated disease diagnosis, medical image processing, and auxiliary surgical operation, etc. Despite the highly efficient diagnosis benefited from machine learning approaches, these methods may not be robust to common challenges in practical scenarios, such as special while crucial characteristics of medical data, annotation variations from multiple experts, noisy annotations, and multi-source datasets. Such problems impede machine learning methods from being applied accurately and safely to medical tasks. In the thesis, we introduce special while important medical problems that were not brought into the spotlight before. We then provide corresponding robust machine learning solutions for each problem when existing machine learning methods degrade significantly in these tasks. Specifically, the first problem is the similarity analysis for time series with large discontinuities, which is common in surgical time series. We thus propose a robust distance measurement for time series with large discontinuities when they disable the accurate measurement of local characteristics using existing algorithms. Second, surgical policies provided by different surgeons for the same patient/surgery may not be exactly the same. We then propose the reward-penalty Dice loss (RPDL) to learn non-unique surgical segmentation regions for deep vision networks. RPDL is robust to varying annotations for the same input, which enables the comprehensive learning of models from multiple experts. Third, medical datasets might be composed of limited examples and noisy annotations, making it challenging to train deep learning models. To address this challenge, we propose alpha-IoU, a family of power Intersection over Union (IoU) losses for bounding box (bbox) regression. We show that alpha-IoU losses are more robust to small datasets and noisy bboxes in lesion detection. Fourth, large-scale medical datasets are often collected from different institutions cooperated by a number of experts. In this case, we build a one-stage framework SpineOne for detecting degenerative discs and vertebrae from spinal MRIs, which implements both the keypoint localization and classification tasks simultaneously. SpineOne is a robust detector to multi-source MRI slices with various scales, numbers and quality. All four proposed machine learning approaches outperform existing baselines by a noticeable margin in specific medical tasks. In summary, four medical issues are thoroughly investigated in the thesis, i.e., the distance measurement for surgical time series with large discontinuities, the surgical region segmentation with a variety of clinician annotations, the lesion detection with limited examples and noisy bboxes, and the anatomical keypoint detection with multi-source medical data. Towards more robust medical machine learning, we then propose one robust machine learning approach for each corresponding problem.
  • Item
    Thumbnail Image
    Learning to generalise through features
    Grebenyuk, Dmitry ( 2020)
    A Markov decision process (MDP) cannot be used for learning end-to-end control policies in Reinforcement Learning when the dimension of the feature vectors changes from one trial to the next. For example, this difference is present in an environment where the number of blocks to manipulate can vary. Because we cannot learn a different policy for each number of blocks, we suggest framing the problem as a POMDP instead of the MDP. It allows us to construct a constant observation space for a dynamic state space. There are two ways we can achieve such construction. First, we can design a hand-crafted set of observations for a particular problem. However, that set cannot be readily transferred to another problem, and it often requires domain-dependent knowledge. On the other hand, a set of observations can be deduced from visual observations. This approach is universal, and it allows us to easily incorporate the geometry of the problem into the observations, which can be challenging to hard-code in the former method. In this Thesis, we examine both of these methods. Our goal is to learn policies that can be generalised to new tasks. First, we show that a more general observation space can improve the performance of policies tested in untrained tasks. Second, we show that meaningful feature vectors can be obtained from visual observations. If properly regularised, these vectors can reflect the spacial structure of the state space and used for planning. Using these vectors, we construct an auto-generated reward function, able to learn working policies.