Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 2 of 2
  • Item
    Thumbnail Image
    Learning Spatial Indices Efficiently
    Liu, Guanli ( 2023-06)
    Machine learning and database management systems have been extensively researched for many years. A database management system, while supporting data persistence and stable queries, can often encounter efficiency issues when building indices for querying big data. In recent years, machine learning is gradually used in databases to solve efficiency issues such as knob tuning, index selection, cost estimation, and index building. Most problems are solved by using regression models and reinforcement learning. Here, using machine learning techniques for index building leads to a new type of index structures named the learned index, which has shown better query performance than traditional indices, e.g., B-trees and R-trees. However, the efficiency of building learned indices is still a key challenge, especially when facing large-scale datasets. The reason is that index building is based on the whole dataset, which requires a full dataset scan in each epoch, and there is at least one epoch for building a learned index. Existing learned indices suffer from this issue, which hinders the application of indices in the database management system. In this thesis, we address this efficiency issue for index learning over a special type of data, the spatial data, i.e., data associated with geographical location information, the volume of which is rapidly growing due to the prevalence of smart mobile devices, the Internet of Things, and 5G networks. We study four research problems on effective and efficient spatial data indexing using machine learning techniques. The first problem is an empirical study of two learned spatial indices RSMI and ZM, which support point, range, and kNN queries. These indices have reported better query performance than a traditional spatial index, the R-trees. However, there is no open-source code or extensive basis that supports testing and evaluating these learned indices against other traditional spatial indices for large real-world datasets. We address such an issue by offering an implementation of these learned indices. Based on the implementation, we present a thorough empirical analysis of the advantages and disadvantages of learned spatial indices, highlighting their significant build times and motivating our studies to optimize the build time efficiency of learned spatial indices. In the second research problem, we address the efficiency issue in learning spatial indices by proposing an index learning framework called ELSI. The key idea of ELSI is that learning from a smaller dataset can derive similar query performance to that using the full input dataset. Experiments on real datasets with over 100 million points show that ELSI can reduce the build times of four different learned spatial indices consistently and by up to two orders of magnitude without jeopardizing query efficiency. In the third research problem, we propose to pre-train index models offline and only fine-tune them online for index learning to accelerate the building of learned (spatial) indices. The results show that we improve the build time of learned one-dimensional indices by 30.4\% and improve lookup efficiency by up to 24.4\% on real datasets and 22.5\% on skewed synthetic datasets. When this technique is applied to spatial data, it speeds up learned spatial index building by two orders of magnitude, while the lookup efficiency can also be increased by up to 13\%. While learned spatial indices have shown strong query performance, their structure and query algorithms differ drastically from the traditional indices, which are well supported by off-the-shelf database systems. To reduce the additional overhead of replacing traditional indices with learned spatial indices, in the fourth research problem, we study applying learning-based techniques to optimize the structure of traditional spatial indices. We focus on indices based on space-filling curves (SFC). SFCs are a classic technique to transform multidimensional (e.g., spatial) data into one-dimensional values for data indexing. The choice of SFC for indexing over a particular dataset and query workload has a significant impact on the query performance of the resultant index. We propose algorithms that enable estimating the query performance of an SFC-based index without actually building the index, thus enabling efficient computing of a query-optimized SFC-based index. Experiments show that our cost estimation algorithms are over an order of magnitude faster than naive methods. The computed SFC indices outperform competing indices based on two classic types of SFCs, i.e., Z-curves and Hilbert curves, in nearly all settings considered.
  • Item
    Thumbnail Image
    Learned Hashmap for Spatial Queries
    Haozhan, Shi ( 2020)
    Spatial indexes such as R-Tree are widely used for managing spatial objects data efficiently, which is influenced by the popular one-dimensional range index B-Tree. Research has suggested that applying machine learning techniques such as linear regression or a neural network can improve the performance of traditional data structures. However, most studies are focused on tuning recursive learned models or learning a different ordering of data items. Many of them cannot guarantee query accuracy as in the traditional methods. This study investigates a different approach to the learned spatial index, namely Learned Spatial Hashmap (LSPH), which combines the learned model and hashmap. It only requires values from one of the data dimensions to build. Results from experiments on both synthetic and real-world datasets show that our approach significantly reduces the query processing time and maintains 100% accuracy, which is more efficient than traditional spatial indexes and more robust than recently proposed learned spatial indexes.