HBase is an open source distributed NoSQL database. It is widely used in the industry - Spotify, Adobe, Bloomberg, Airbnb etc - to store large amounts of data in a fault tolerant way, while giving high read/write throughput.
HBase, like most other databases, needs ‘index’ to make searching for a entry within the database fast. Index structure greatly influences the database’s performance.
Traditionally, a B-tree is used to store this index. But this paper proposed a smart idea: use a ‘learned index’ instead of a B-Tree.
Consider this small key value dataset you want to store into your database. (Note: It is sorted by key)
1. 23:'value1-veryLong-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
2. 34:'value2-veryLong-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
3. 56:'value3-veryLong-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
4. 78:'value4-veryLong-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
5. 90:'value5-veryLong-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Now, if you have to retrieve the value for key (if it exists), say 78, you need an index to tell you which position is your key in this dataset (position 4 here). A B-tree would perform binary search style traversal, taking O(logn) time.
But what if I learned an index function f()
, s.t f(key)=position
. i.e f(78)=4, f(90)=5
and so on? You can search for the key in O(1)!
That’s the idea of ‘learned index’. Yes, yes, a lot more details are needed to make this work!
So learned index makes index searching fast. This is good if your entire DB is in memory. For disk based DBs - which is practically any large scale DB, like HBase - index traversal is only a small part of overall latency and speed improvements from having learned index are minimal. But, this paper from Google showed that there are interesting secondary effects of using learned index that ultimately give performance improvements! What are they? Well, that’s the bait 😛. Read the paper and watch this talk.