Authors
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska
Massachusetts Institute of Technology; Microsoft Research; Arizona State University; Georgia Institute of Technology
Portals
Abstract
Recent work on "learned indexes" has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as "models" that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory footprint. However, it is limited to static, read-only workloads. In this paper, we present a new learned index called ALEX which addresses practical issues that arise when implementing learned indexes for workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. ALEX effectively combines the core insights from learned indexes with proven storage and indexing techniques to achieve high performance and low memory footprint. On read-only workloads, ALEX beats the learned index from Kraska et al. by up to 2.2X on performance with up to 15X smaller index size. Across the spectrum of read-write workloads, ALEX beats B+Trees by up to 4.1X while never performing worse, with up to 2000X smaller index size. We believe ALEX presents a key step towards making learned indexes practical for a broader class of database workloads with dynamic updates.
Contribution
- Storage layout optimized for models: Similar to a B+Tree, ALEX builds a tree, but allows different nodes to grow and shrink at different rates. To store records in a data node, ALEX uses an array with gaps, a Gapped Array, which (1) amortizes the cost of shi?ing the keys for each insertion because gaps can absorb inserts, and (2) allows more accurate placement of data using model-based inserts to ensure that records are located closely to the predicted position when possible. For efficient search, gaps are actually ?lled with adjacent keys
- Search strategy optimized for models: ALEX exploits model-based inserts combined with exponential search starting from the predicted position. This always beats binary search when models are accurate
- Keeping models accurate with dynamic data distributions and workloads: ALEX provides robust performance even when the data distribution is skewed or dynamically changes after index initialization. ALEX achieves this by exploiting adaptive expansion, and node splitting mechanisms, paired with selective model retraining, which is triggered by intelligent policies based on simple cost models. Our cost models take the actual workload into account and thus can effectively respond to dynamic changes in the workload. ALEX achieves all the above benefits without needing to hand-tune parameters for each dataset or workload
- Detailed evaluation: We present the results of an extensive experimental analysis with real-life datasets and varying read-write workloads and compare against state-of-the-art indexes that support range queries.