Home >Backend Development >Golang >Building an LSM-Tree Storage Engine from Scratch
This article will guide you through understanding the Log-Structured Merge-Tree (LSM-Tree), including its core concepts and structure. By the end, you'll be able to build your own storage engine based on LSM-Tree from scratch.
The Log-Structured Merge-Tree (LSM-Tree) is a data structure optimized for high-throughput write operations. It is widely used in databases and storage systems, such as Cassandra, RocksDB, and LevelDB.
The key idea of LSM-Tree is to first write operations into an in-memory data structure (typically an ordered structure like a skip list or AVL tree). Later, these writes are batched and sequentially written to disk as SSTables, thereby minimizing random I/O.
The LSM-Tree is divided into two main components:
Typically, the structure of an SSTable includes more than just a series of ordered key-value pairs (data blocks). It also contains an index block, metadata block, and other components. These details will be discussed in-depth during the implementation section.
Writing data involves adding a new key-value pair or updating an existing one. Updates overwrite old key-value pairs, which are later removed during the compaction process.
When data is written, it first goes to the Memtable, where the key-value pair is added to the in-memory ordered data structure. Simultaneously, the write operation is logged in the WAL and persisted to disk to prevent data loss in the event of a database crash.
The Memtable has a defined threshold (usually based on size). When the Memtable exceeds this threshold, it is switched to read-only mode and converted into a new SSTable, which is then persisted to Level 0 on disk.
Once the Memtable is flushed as an SSTable, the corresponding WAL file can be safely deleted. Subsequent write operations will proceed on a new Memtable (and a new WAL).
In LSM-Tree, data is not immediately removed. Instead, deletions are handled using a mechanism called tombstones (similar to soft deletes). When a key-value pair is deleted, a new entry marked with a "tombstone" is written, indicating the deletion of the corresponding key-value pair. The actual removal occurs during the compaction process.
This tombstone-based deletion ensures the append-only property of LSM-Tree, avoiding random I/O and maintaining sequential writes to disk.
The process of querying data starts with a search in the Memtable. If the key-value pair is found, it is returned to the client. If a tombstone-marked key-value pair is found, it indicates that the requested data has been deleted, and this information is also returned. If the key is not found in the Memtable, the query proceeds to search the SSTables from Level 0 to Level N.
Since querying data may involve searching multiple SSTable files and can lead to random disk I/O, LSM-Tree is generally better suited for write-heavy workloads rather than read-intensive ones.
One common optimization for query performance is the use of a Bloom filter. A Bloom filter can quickly determine whether a key-value pair exists in a specific SSTable, reducing unnecessary disk I/O. Additionally, the sorted nature of SSTables enables efficient search algorithms, such as binary search, to be employed for faster lookups.
Here, we introduce the Leveled Compaction Strategy, which is used by LevelDB and RocksDB.
Another common strategy is the Size-Tiered Compaction Strategy, where newer and smaller SSTables are successively merged into older and larger SSTables.
As previously mentioned, an SSTable stores a series of entries sorted by key. In the Leveled Compaction Strategy, SSTables are organized into multiple levels (Level 0 to Level N).
In Level 0, SSTables can have overlapping key ranges, as they are directly flushed from the Memtable. However, in Levels 1 through N, SSTables within the same level do not have overlapping key ranges, although key range overlaps are allowed between SSTables in different levels.
An illustrative (though not entirely accurate) example is shown below. In Level 0, the key ranges of the first and second SSTables overlap, while in Level 1 and Level 2, the SSTables within each level have disjoint key ranges. However, SSTables between different levels (e.g., Level 0 and Level 1, or Level 1 and Level 2) may have overlapping key ranges.
Let’s now explore how the Leveled Compaction Strategy maintains this organizational structure.
Since Level 0 is a special case, the compaction strategy discussion is divided into two parts:
The primary difference between Level 0 to Level 1 compaction and Level N to Level N 1 (N > 0) compaction lies in the selection of SSTables at lower levels (Level 0 or Level N).
The multi-SSTable compaction and merging process is illustrated below. During the merge, only the latest value for each key is retained. If the latest value has a "tombstone" marker, the key is deleted. In the implementation, we use the k-way merge algorithm to perform this process.
It is important to note that the above description of the compaction process provides only a high-level overview. Many details need to be addressed during actual implementation.
For example, in LevelDB, when constructing new SSTables for Level N 1 during compaction, if the new SSTable overlaps with more than 10 SSTables in Level N 2, the process switches to constructing another SSTable. This limits the data size involved in a single compaction.
Based on the overview of LSM-Tree above, I believe you now have a basic understanding of LSM-Tree and some ideas about its implementation. Next, we will build a storage engine based on LSM-Tree from scratch. Below, we will introduce only the core code; for the complete code, please refer to https://github.com/B1NARY-GR0UP/originium.
We will break down the implementation of LSM-Tree into the following core components and implement them one by one:
In the process of introducing data writing, we mentioned that the LSM-Tree first writes data into an in-memory ordered data structure. Some common ordered data structures and the time complexity of their operations are as follows:
Data Structure | Insert | Delete | Search | Traverse |
---|---|---|---|---|
Skip List | O(logn) | O(logn) | O(logn) | O(n) |
AVL Tree | O(logn) | O(logn) | O(logn) | O(n) |
Red-Black Tree | O(logn) | O(logn) | O(logn) | O(n) |
We chose the Skip List for two main reasons: it is simpler to implement and maintain (KISS principle), and the underlying linked list structure facilitates sequential traversal, making it easier to persist in-memory data to disk.
The complete implementation of the Skip List is available at https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/skiplist/skiplist.go .
A Skip List consists of a base linked list and multiple levels of indices built on top of it. For large datasets, the index layers significantly shorten the search path.
In our implementation, the core structure of the Skip List is defined as follows:
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
The structure of the elements stored in the Skip List is defined as follows:
type Element struct { types.Entry next []*Element } // https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go type Entry struct { Key string Value []byte Tombstone bool }
types.Entry represents a key-value pair in the storage engine, including the key, value, and a tombstone flag for deletion.
next: Contains pointers to the next element at each level.
This structure may seem abstract, so let's illustrate it with an example:
Level 3: 3 ----------- 9 ----------- 21 --------- 26 Level 2: 3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26 Level 1: 3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26 next of head [ ->3, ->3, ->3 ] next of Element 3 [ ->6, ->6, ->9 ] next of Element 6 [ ->7, ->9 ]
In this three-level Skip List, the next pointers of the head node reference the first node at each level. Elements 3 and 6 store the next element for each of their levels.
For example, if we want to find the next node of element 19 at Level 2, we use e19.next[2-1].
func (s *SkipList) Set(entry types.Entry)
The LSM-Tree uses tombstones to perform deletions, so we don’t need a Delete method in the skip list implementation. To delete an element, simply set the Tombstone of the entry to true. Thus, the Set method handles inserting new key-value pairs, updating existing ones, and deleting elements.
Let’s explore the implementation of the Set method. By traversing the nodes in each level from the highest, the last element smaller than the key to be set is saved in the update slice.
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
At the end of this traversal, curr points to the last element smaller than the key to be set in the bottom-level linked list. So, we only need to check if the next element of curr equals the key we want to set. If it matches, the element has already been inserted; we update the existing element and return.
type Element struct { types.Entry next []*Element } // https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go type Entry struct { Key string Value []byte Tombstone bool }
If the element is not found, it is inserted as a new element. Using randomLevel, we calculate the index level of this element. If it exceeds the current number of levels in the skip list, we add the head node to the update slice and update s.level to the new level count.
Level 3: 3 ----------- 9 ----------- 21 --------- 26 Level 2: 3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26 Level 1: 3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26 next of head [ ->3, ->3, ->3 ] next of Element 3 [ ->6, ->6, ->9 ] next of Element 6 [ ->7, ->9 ]
Next, construct the element to be inserted, and the next pointers of each level are updated to complete the insertion.
func (s *SkipList) Set(entry types.Entry)
The skip list can perform fast search operations by relying on multiple layers of indexes. The nested for loops in the implementation represent the index-based search operation. If the corresponding element is eventually found in the bottom-level linked list, it will be returned.
curr := s.head update := make([]*Element, s.maxLevel) for i := s.maxLevel - 1; i >= 0; i-- { for curr.next[i] != nil && curr.next[i].Key < entry.Key { curr = curr.next[i] } update[i] = curr }
One reason we chose the skip list is its convenient sequential traversal, which is made possible by simply traversing the bottom-level linked list.
// update entry if curr.next[0] != nil && curr.next[0].Key == entry.Key { s.size += len(entry.Value) - len(curr.next[0].Value) // update value and tombstone curr.next[0].Value = entry.Value curr.next[0].Tombstone = entry.Tombstone return }
The complete implementation of WAL can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/wal/wal.go.
As mentioned earlier, the purpose of WAL (Write-Ahead Logging) is to prevent data loss in the Memtable caused by database crashes. Therefore, WAL needs to record operations on the Memtable and recover the Memtable from the WAL file when the database restarts.
The core structure of WAL is as follows, where fd stores the file descriptor of the WAL file:
// add entry level := s.randomLevel() if level > s.level { for i := s.level; i < level; i++ { update[i] = s.head } s.level = level }
Since we need to record operations on the Memtable, this essentially involves writing each operation (Set, Delete) as an Entry into the WAL. The definition of the Write method is as follows:
e := &Element{ Entry: types.Entry{ Key: entry.Key, Value: entry.Value, Tombstone: entry.Tombstone, }, next: make([]*Element, level), } for i := range level { e.next[i] = update[i].next[i] update[i].next[i] = e } s.size += len(entry.Key) + len(entry.Value) + int(unsafe.Sizeof(entry.Tombstone)) + len(e.next)*int(unsafe.Sizeof((*Element)(nil)))
When writing these entries to the file, we need to standardize the WAL file format. The format we use here is length data. First, we serialize the Entry, then calculate the length of the serialized data, and finally write the length and serialized data sequentially into the WAL file.
The core code is as follows:
func (s *SkipList) Get(key types.Key) (types.Entry, bool) { curr := s.head for i := s.maxLevel - 1; i >= 0; i-- { for curr.next[i] != nil && curr.next[i].Key < key { curr = curr.next[i] } } curr = curr.next[0] if curr != nil && curr.Key == key { return types.Entry{ Key: curr.Key, Value: curr.Value, Tombstone: curr.Tombstone, }, true } return types.Entry{}, false }
Since we use the WAL file format length data, during reading, we first read 8 bytes (int64) to obtain the length of the data, and then read the data based on this length and deserialize it to retrieve the Entry.
The core code is as follows:
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
The complete implementation of Memtable can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/memtable.go.
The Memtable is responsible for writing client operations into the skip list and recording them in the WAL. It can also recover data from the WAL when the database starts.
The core structure of the Memtable is as follows, which includes two main components skiplist and wal:
type Element struct { types.Entry next []*Element } // https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go type Entry struct { Key string Value []byte Tombstone bool }
When performing a Set operation, both the skip list and the WAL need to be updated simultaneously.
Level 3: 3 ----------- 9 ----------- 21 --------- 26 Level 2: 3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26 Level 1: 3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26 next of head [ ->3, ->3, ->3 ] next of Element 3 [ ->6, ->6, ->9 ] next of Element 6 [ ->7, ->9 ]
To retrieve a value, simply return the result of the skip list's Get operation.
func (s *SkipList) Set(entry types.Entry)
Recovering the Memtable from the WAL file involves first reading the WAL file, then sequentially applying the Entry records from the WAL file to the Memtable, and finally deleting the recovered WAL file.
Retrieve the list of WAL files:
curr := s.head update := make([]*Element, s.maxLevel) for i := s.maxLevel - 1; i >= 0; i-- { for curr.next[i] != nil && curr.next[i].Key < entry.Key { curr = curr.next[i] } update[i] = curr }
Read the WAL and recover the Memtable:
// update entry if curr.next[0] != nil && curr.next[0].Key == entry.Key { s.size += len(entry.Value) - len(curr.next[0].Value) // update value and tombstone curr.next[0].Value = entry.Value curr.next[0].Tombstone = entry.Tombstone return }
In the previous introduction, we only mentioned that "SSTable (Sorted String Table) is a data storage format that maintains a series of ordered key-value pairs." Here, we will provide a more detailed explanation of the structure of SSTable.
In LevelDB, an SSTable consists of multiple blocks with different purposes. A schematic diagram is shown below:
The index information is essentially a pointer structure called a BlockHandle, which includes two attributes: offset and size, used to locate the corresponding Block.
In our implementation of SSTable, we have simplified the LevelDB SSTable structure. A schematic diagram is shown below:
The complete implementation of SSTable can be found at https://github.com/B1NARY-GR0UP/originium/tree/main/sstable.
The structure of the Data Block is defined as follows, storing an ordered sequence of entries.
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
We implemented three primary methods for the Data Block:
type Element struct { types.Entry next []*Element } // https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go type Entry struct { Key string Value []byte Tombstone bool }
We use prefix compression to encode the key-value sequence. In the buffer, we sequentially write the length of the common prefix, the length of the suffix, the suffix itself, the length of the value, the value, and the "tombstone" marker.
Level 3: 3 ----------- 9 ----------- 21 --------- 26 Level 2: 3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26 Level 1: 3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26 next of head [ ->3, ->3, ->3 ] next of Element 3 [ ->6, ->6, ->9 ] next of Element 6 [ ->7, ->9 ]
Finally, we compress the data using s2.
S2 is a high-performance extension of the Snappy compression algorithm.
func (s *SkipList) Set(entry types.Entry)
curr := s.head update := make([]*Element, s.maxLevel) for i := s.maxLevel - 1; i >= 0; i-- { for curr.next[i] != nil && curr.next[i].Key < entry.Key { curr = curr.next[i] } update[i] = curr }
During decoding, the process is simply reversed. The full key-value pair is reconstructed using the prefix and suffix.
// update entry if curr.next[0] != nil && curr.next[0].Key == entry.Key { s.size += len(entry.Value) - len(curr.next[0].Value) // update value and tombstone curr.next[0].Value = entry.Value curr.next[0].Tombstone = entry.Tombstone return }
// add entry level := s.randomLevel() if level > s.level { for i := s.level; i < level; i++ { update[i] = s.head } s.level = level }
The structure of the Index Block is defined as follows. It stores the first and last key of each Data Block, along with the BlockHandle of the corresponding Data Block.
e := &Element{ Entry: types.Entry{ Key: entry.Key, Value: entry.Value, Tombstone: entry.Tombstone, }, next: make([]*Element, level), } for i := range level { e.next[i] = update[i].next[i] update[i].next[i] = e } s.size += len(entry.Key) + len(entry.Value) + int(unsafe.Sizeof(entry.Tombstone)) + len(e.next)*int(unsafe.Sizeof((*Element)(nil)))
Similarly, the Index Block implements three primary methods: Encode, Decode, and Search. The implementation ideas for the Encode and Decode methods are essentially the same, so we will focus on the Search method.
The Search method of the Data Block is designed to locate a specific key-value pair within the ordered key-value sequence stored in a single Data Block. In contrast, the Search method of the Index Block is used to locate the Data Block containing the given key within the entire SSTable.
func (s *SkipList) Get(key types.Key) (types.Entry, bool) { curr := s.head for i := s.maxLevel - 1; i >= 0; i-- { for curr.next[i] != nil && curr.next[i].Key < key { curr = curr.next[i] } } curr = curr.next[0] if curr != nil && curr.Key == key { return types.Entry{ Key: curr.Key, Value: curr.Value, Tombstone: curr.Tombstone, }, true } return types.Entry{}, false }
func (s *SkipList) All() []types.Entry { var all []types.Entry for curr := s.head.next[0]; curr != nil; curr = curr.next[0] { all = append(all, types.Entry{ Key: curr.Key, Value: curr.Value, Tombstone: curr.Tombstone, }) } return all }
The implementations of these two Blocks are quite straightforward, with both requiring only the Encode and Decode methods.
After introducing all the Blocks in our SSTable, constructing an SSTable simply involves building each Block step by step based on the key-value pairs. Finally, the in-memory index and the encoded SSTable are returned.
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
The complete implementation of K-Way Merge is available at https://github.com/B1NARY-GR0UP/originium/tree/main/pkg/kway.
In the conceptual section, we illustrate the process of compressing and merging multiple SSTables through diagrams. This process is accomplished using the k-way merge algorithm.
The k-way merge algorithm is a method to merge k sorted sequences into a single sorted sequence, with a time complexity of O(knlogk).
One implementation of this algorithm uses a min-heap as an auxiliary structure:
The standard library provides a heap implementation in container/heap. By implementing the heap.Interface, we can build a min-heap.
type Element struct { types.Entry next []*Element } // https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go type Entry struct { Key string Value []byte Tombstone bool }
Level 3: 3 ----------- 9 ----------- 21 --------- 26 Level 2: 3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26 Level 1: 3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26 next of head [ ->3, ->3, ->3 ] next of Element 3 [ ->6, ->6, ->9 ] next of Element 6 [ ->7, ->9 ]
func (s *SkipList) Set(entry types.Entry)
The function definition of the Merge method:
curr := s.head update := make([]*Element, s.maxLevel) for i := s.maxLevel - 1; i >= 0; i-- { for curr.next[i] != nil && curr.next[i].Key < entry.Key { curr = curr.next[i] } update[i] = curr }
Follows the k-way merge algorithm process.
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
type Element struct { types.Entry next []*Element } // https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go type Entry struct { Key string Value []byte Tombstone bool }
Finally, traverse the map to add elements to the result queue, removing any key-value pairs marked as "tombstones." Since the map is unordered, the result queue needs to be sorted before being returned.
Level 3: 3 ----------- 9 ----------- 21 --------- 26 Level 2: 3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26 Level 1: 3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26 next of head [ ->3, ->3, ->3 ] next of Element 3 [ ->6, ->6, ->9 ] next of Element 6 [ ->7, ->9 ]
The complete implementation of the Bloom Filter can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/filter/filter.go.
A Bloom Filter is a data structure that efficiently checks whether an element is a member of a set.
The time complexity for both insertion and query operations is O(k), where k is the number of hash functions. False positives may occur (the Bloom Filter predicts that an element is in the set, but it is not), but false negatives cannot occur (the Bloom Filter predicts that an element is not in the set, but it actually is).
True Positive (TP): The system predicts an event as "positive," and it is indeed positive.
False Positive (FP): The system predicts an event as "positive," but it is actually negative.
True Negative (TN): The system predicts an event as "negative," and it is indeed negative.
False Negative (FN): The system predicts an event as "negative," but it is actually positive.
The core structure of a Bloom Filter includes the bit array (which can be optimized to use uint8) and multiple hash functions.
func (s *SkipList) Set(entry types.Entry)
The method to create a Bloom Filter instance accepts two parameters: n (the expected number of elements) and p (the desired false positive rate).
First, the parameters are validated. Then, the size of the bit array (m) and the number of hash functions (k) are calculated using specific formulas. Finally, the bit array and hash functions are initialized based on m and k.
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
When adding an element, all hash functions are used to compute hash values for the key. These values are then mapped to indices in the bit array, and the corresponding positions are set to true.
type Element struct { types.Entry next []*Element } // https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go type Entry struct { Key string Value []byte Tombstone bool }
To check if a key is in the set, the hash functions compute hash values and map them to indices in the bit array. If any of these positions is not true, the element is not in the set, and false is returned.
Level 3: 3 ----------- 9 ----------- 21 --------- 26 Level 2: 3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26 Level 1: 3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26 next of head [ ->3, ->3, ->3 ] next of Element 3 [ ->6, ->6, ->9 ] next of Element 6 [ ->7, ->9 ]
The complete implementation of Leveled Compaction can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/level.go.
After implementing components like K-Way Merge and Bloom Filter, we can complete the final part of our implementation, the most crucial SSTable compaction process in the LSM-Tree storage engine. This implementation follows the Leveled Compaction Strategy described in the "Data Compaction" section.
In the Leveled Compaction Strategy, SSTables are organized into multiple levels (Level 0 - Level N). We need a structure to store this information and manage SSTables across different levels.
Thus, we implemented a structure called levelManager. We use a []*list.List to store SSTable information for each level, where the index of the slice corresponds to the level. Each element in the slice is a list.List, a doubly-linked list that holds information about all SSTables in a specific level.
func (s *SkipList) Set(entry types.Entry)
The compactLN method is responsible for Level N to Level N 1 (N > 0) compaction. It selects the oldest SSTable from LN and all SSTables from LN 1 that have overlapping key ranges with this SSTable.
curr := s.head update := make([]*Element, s.maxLevel) for i := s.maxLevel - 1; i >= 0; i-- { for curr.next[i] != nil && curr.next[i].Key < entry.Key { curr = curr.next[i] } update[i] = curr }
The selected SSTables are processed in order from oldest to newest. Key-value pairs from their data blocks are added to a two-dimensional slice and then merged using the K-Way Merge algorithm.
// update entry if curr.next[0] != nil && curr.next[0].Key == entry.Key { s.size += len(entry.Value) - len(curr.next[0].Value) // update value and tombstone curr.next[0].Value = entry.Value curr.next[0].Tombstone = entry.Tombstone return }
With the merged key-value pairs, we construct a new Bloom Filter and SSTable. The relevant information about the new SSTable is appended to the end of Level N 1.
// add entry level := s.randomLevel() if level > s.level { for i := s.level; i < level; i++ { update[i] = s.head } s.level = level }
Finally, the old SSTables are deleted, and the newly constructed SSTable is written to disk.
e := &Element{ Entry: types.Entry{ Key: entry.Key, Value: entry.Value, Tombstone: entry.Tombstone, }, next: make([]*Element, level), } for i := range level { e.next[i] = update[i].next[i] update[i].next[i] = e } s.size += len(entry.Key) + len(entry.Value) + int(unsafe.Sizeof(entry.Tombstone)) + len(e.next)*int(unsafe.Sizeof((*Element)(nil)))
The compactL0 method handles Level 0 to Level 1 compaction. Unlike compactLN, it selects not only one SSTable from L0 but also all overlapping SSTables in L0. The rest of the process is identical to compactLN.
The search method locates the corresponding key-value pairs across all SSTables. It starts from L0 and iterates through each level up to LN. By leveraging the Bloom Filter and the ordered structure of SSTables, SSTables that do not contain the desired key-value pairs can be skipped efficiently.
type SkipList struct { maxLevel int p float64 level int rand *rand.Rand size int head *Element }
With this, we have implemented all core components of the LSM-Tree-based storage engine. By assembling these components as described in the LSM-Tree introduction, we can finalize the database interface.
Complete code: https://github.com/B1NARY-GR0UP/originium/blob/main/db.go
Documentation: https://github.com/B1NARY-GR0UP/originium?tab=readme-ov-file#usage
We began by understanding LSM-Tree, familiarizing ourselves with its core components and the process of handling client requests. Ultimately, we built our own LSM-Tree storage engine from scratch.
Of course, this implementation is just a prototype. A production-grade storage engine requires consideration of many more details. ORIGINIUM will continue to receive optimizations and improvements in the future. Hope this article and ORIGINIUM help deepen your understanding of LSM-Tree.
That concludes everything covered in this article. If there are any errors or questions, feel free to reach out via private message or leave a comment. Thank you!
The above is the detailed content of Building an LSM-Tree Storage Engine from Scratch. For more information, please follow other related articles on the PHP Chinese website!