Learned about database by implementing B+Tree in Kotlin
tl; dr — I just implemented a very simple B+Tree in Kotlin to learn about inside database. It is only ~960 lines of code.
Motivation
I’m always curious about various areas of software. One last example was “language processing system” and I implemented it to learn. I’m so fool that I can only understand software when I implement it by myself.
My next target is database since it is required to handle state on file system. and this is a new area that I’d like to learn. About 10 years ago, I was a DBA for MySQL database, so I’m familiar with the basic idea of database, especially InnoDB’s clustered index. But it was just knowledge from books and articles and my practical experience. I have no background around database science, so I didn’t know what B+Tree is.
6 years ago, I just tried to implement B+Tree inspired by Jeremy Cole’s “B+Tree index structures in InnoDB”. But I was not matured to develop such complex computer science-ish software at that time, so I gave up.
Now, since I just became a software developer last year, I felt comfortable to try it again. Also, I wanted to try Kotlin just for curious. So, I decided to run my project — “B+Tree implementation in Kotlin” just one month ago.
What B+Tree is?
First, I had to understand the algorithm and data structure of B+Tree. Connor Stack’s “Introduction to the B-Tree” is a great article to know it, especially the difference between B-Tree and B+Tree.
In B+Tree, Internal nodes only stores key to point the corresponding Leaf node. All table data are stored inside Leaf node. In the article, the key is chosen from the maximum key in the left-hand Leaf node, but I used minimum key instead since InnoDB does so.
B+Tree’s one of main algorithms is “split”. When Leaf node is full, B+Tree splits the node, then links the new node with the parent. It could cause page full on the parent as well, then B+Tree also split the parent, and so on.
However, I didn’t get the idea of actual “split” mechanism right away. Marco Tusa’s “InnoDB Page Merging and Page Splitting” is another key article for my project then. According to the post, InnoDB tries to split a page at around the middle of size. Aha, that seems very clear and it also implies that the reason of InnoDB’s record size limitation (~1/2 of page size).
File format — Apache Avro
I did want to implement B+Tree with file interaction. Because B+Tree is intended to be used as a huge database which can’t fit in memory, file synchronization is one of key aspects of B+Tree, I think.
Since InnoDB’s file format is too complex for my simple project, I’d have liked to introduce a file format for this project. However, SerDe implementation of a custom format was not my target of this project and it could be buggy. So, I looked around a few serialization formats, such as Protocol Buffer and Apache Avro.
I picked Avro since it has unique features I wanted:
- Fixed size type
- Sortable without deserialization
- Easy to use in Kotlin
Here is the file format of Page:
{
"name": "PageData",
"type": "record",
"fields": [
{"name": "id", "type": "PageId"},
{"name": "nodeType", "type": "NodeType"},
{"name": "previousId", "type": "PageId"},
{"name": "nextId", "type": "PageId"},
{"name": "records", "type": {"type": "array", "items": "KeyValue"} }
]
}
Thanks to Avro, I didn’t have to spend my time to implement file format related stuff and was able to focus on B+Tree implementation. During this project, I’ve also learned a lot about Avro. I’d use Avro for my future project, too.
Implementation
It was back and forth / scrap and build process, but essentially I did the order below. Also, description of each class is available in README.
Page
I implemented Page class first. It uses Avro as a backend (in-memory) and supports mutational operations (insert/update/delete) only. It exposes records as a List, so Nodes can search them as they want. Page doesn’t care about table schema; it just treat key and value as a pair of ByteBuffer.
(Leaf)Node
This class has a Page as a backend and supports basic database operations (get/put/delete). Those operations rely on that Page’s records are ordered by key and always search from the minimum. It costs O(N), but N is just the number of records in a Page, not the total records of a table. InnoDB does the same, but uses linked list.
InternalNode
InternalNode is a child class of LeafNode and stores each child Node’s minimum key and Page id. To make algorithm always correct, I use logical minimum key (0 bytes) to store the most left-hand side Nodes. (It happens only when splitting RootNode)
RootNode
RootNode is a child class of InternalNode. Like InnoDB, RootNode is special since its Page id is fixed. To split it, I create a new Page and copy all records there. Then, split the Page and connects those child Pages to the RootNode.
PageManager
This class manages lifecycle of Pages. When get is called, it tries to fetch from cache (so far just Map
) and reads from file if cache missed. It also allocates a new Page and handle actual Page split logic.
FileManager
This class is for managing file read and write. The file has fixed size (128 bytes) metadata at the beginning, then each Page (256 bytes) is stored ordered by Page id.
Tree
Using all classes above, this class can operate a B+Tree.
Table, AvroGenericRecord, Main
These classes are for use case of my implementation. Table class represents a table with key and value schema (Avro), AvroGenericRecord is a useful class for GenericRecord and Main is for sample CLI.
Conclusion
I worked on this project for a month and achieved its goal; I understood basic internal detail of database software — B+Tree. Of course, my implementation is missing lots of features which are required as a database, e.g. merge, WAL, MVCC, but now I have many ideas how to implement them. So, that was a good timing to complete the project.
I hope this project could help future developers who also want to understand B+Tree. This code is a working example and pretty small (960 LOC).