Home >Backend Development >Golang >Building an LSM-Tree Storage Engine from Scratch

Building an LSM-Tree Storage Engine from Scratch

Barbara Streisand
Barbara StreisandOriginal
2025-01-03 07:37:39578browse

Preface

This article will guide you through understanding the Log-Structured Merge-Tree (LSM-Tree), including its core concepts and structure. By the end, you'll be able to build your own storage engine based on LSM-Tree from scratch.

What is LSM-Tree?

Basic Concepts

The Log-Structured Merge-Tree (LSM-Tree) is a data structure optimized for high-throughput write operations. It is widely used in databases and storage systems, such as Cassandra, RocksDB, and LevelDB.

The key idea of LSM-Tree is to first write operations into an in-memory data structure (typically an ordered structure like a skip list or AVL tree). Later, these writes are batched and sequentially written to disk as SSTables, thereby minimizing random I/O.

Basic Structure

Building an LSM-Tree Storage Engine from Scratch

The LSM-Tree is divided into two main components:

  • In-Memory Storage
    • The core structure in memory is the Memtable.
    • All write operations (e.g., set, delete) first go to the Memtable, which inserts these operations into an ordered data structure (e.g., an ordered tree in the diagram).
    • Once the Memtable reaches a certain size threshold, it is flushed to disk as an SSTable (written sequentially).
    • New write operations continue on a fresh Memtable.
  • Disk Storage
    • Disk storage involves WAL and SSTable files.
    • The WAL (Write-Ahead Log) ensures that recent writes (stored in the Memtable but not yet persisted to disk) are not lost in case of a database crash. Every write to the Memtable is appended to the WAL. Upon restarting the database, entries from the WAL can be replayed in order to restore the Memtable to its pre-crash state.
    • The SSTable (Sorted String Table) is a data storage format that holds a series of ordered key-value pairs.
    • When the Memtable reaches its size threshold, it generates a new SSTable and persists it to disk. Since the Memtable relies on an in-memory ordered data structure, no additional sorting is required when constructing the SSTable.
    • SSTables on disk are organized into multiple levels. Newly flushed SSTables are stored in Level 0. During subsequent compaction phases, SSTables in L0 are merged into Level 1 and higher levels.
    • When the size of a level exceeds a threshold, an SSTable compaction process is triggered. During compaction, SSTables in the current level are merged into higher levels, producing larger, more ordered files. This reduces fragmentation and improves query efficiency.

Typically, the structure of an SSTable includes more than just a series of ordered key-value pairs (data blocks). It also contains an index block, metadata block, and other components. These details will be discussed in-depth during the implementation section.

Writing Data

Writing data involves adding a new key-value pair or updating an existing one. Updates overwrite old key-value pairs, which are later removed during the compaction process.

When data is written, it first goes to the Memtable, where the key-value pair is added to the in-memory ordered data structure. Simultaneously, the write operation is logged in the WAL and persisted to disk to prevent data loss in the event of a database crash.

The Memtable has a defined threshold (usually based on size). When the Memtable exceeds this threshold, it is switched to read-only mode and converted into a new SSTable, which is then persisted to Level 0 on disk.

Once the Memtable is flushed as an SSTable, the corresponding WAL file can be safely deleted. Subsequent write operations will proceed on a new Memtable (and a new WAL).

Deleting Data

In LSM-Tree, data is not immediately removed. Instead, deletions are handled using a mechanism called tombstones (similar to soft deletes). When a key-value pair is deleted, a new entry marked with a "tombstone" is written, indicating the deletion of the corresponding key-value pair. The actual removal occurs during the compaction process.

This tombstone-based deletion ensures the append-only property of LSM-Tree, avoiding random I/O and maintaining sequential writes to disk.

Querying Data

The process of querying data starts with a search in the Memtable. If the key-value pair is found, it is returned to the client. If a tombstone-marked key-value pair is found, it indicates that the requested data has been deleted, and this information is also returned. If the key is not found in the Memtable, the query proceeds to search the SSTables from Level 0 to Level N.

Since querying data may involve searching multiple SSTable files and can lead to random disk I/O, LSM-Tree is generally better suited for write-heavy workloads rather than read-intensive ones.

One common optimization for query performance is the use of a Bloom filter. A Bloom filter can quickly determine whether a key-value pair exists in a specific SSTable, reducing unnecessary disk I/O. Additionally, the sorted nature of SSTables enables efficient search algorithms, such as binary search, to be employed for faster lookups.

Data Compaction

Here, we introduce the Leveled Compaction Strategy, which is used by LevelDB and RocksDB.

Another common strategy is the Size-Tiered Compaction Strategy, where newer and smaller SSTables are successively merged into older and larger SSTables.

As previously mentioned, an SSTable stores a series of entries sorted by key. In the Leveled Compaction Strategy, SSTables are organized into multiple levels (Level 0 to Level N).

In Level 0, SSTables can have overlapping key ranges, as they are directly flushed from the Memtable. However, in Levels 1 through N, SSTables within the same level do not have overlapping key ranges, although key range overlaps are allowed between SSTables in different levels.

An illustrative (though not entirely accurate) example is shown below. In Level 0, the key ranges of the first and second SSTables overlap, while in Level 1 and Level 2, the SSTables within each level have disjoint key ranges. However, SSTables between different levels (e.g., Level 0 and Level 1, or Level 1 and Level 2) may have overlapping key ranges.

Building an LSM-Tree Storage Engine from Scratch

Let’s now explore how the Leveled Compaction Strategy maintains this organizational structure.

Since Level 0 is a special case, the compaction strategy discussion is divided into two parts:

  • Level 0 to Level 1 Since Level 0 allows overlapping keys among SSTables, compaction begins by selecting an SSTable from Level 0, along with all other SSTables in Level 0 that have overlapping key ranges with it. Next, all SSTables in Level 1 with overlapping key ranges are selected. These selected SSTables are merged and compacted into a single new SSTable, which is then inserted into Level 1. All old SSTables involved in the compaction process are deleted.
  • Level N to Level N 1 (N > 0) From Level 1 onwards, SSTables within the same level do not have overlapping key ranges. During compaction, an SSTable is selected from Level N, and all SSTables in Level N 1 with overlapping key ranges are also selected. These SSTables are merged and compacted into one or more new SSTables, which are inserted into Level N 1, while the old SSTables are deleted.

The primary difference between Level 0 to Level 1 compaction and Level N to Level N 1 (N > 0) compaction lies in the selection of SSTables at lower levels (Level 0 or Level N).

The multi-SSTable compaction and merging process is illustrated below. During the merge, only the latest value for each key is retained. If the latest value has a "tombstone" marker, the key is deleted. In the implementation, we use the k-way merge algorithm to perform this process.

Building an LSM-Tree Storage Engine from Scratch

It is important to note that the above description of the compaction process provides only a high-level overview. Many details need to be addressed during actual implementation.

For example, in LevelDB, when constructing new SSTables for Level N 1 during compaction, if the new SSTable overlaps with more than 10 SSTables in Level N 2, the process switches to constructing another SSTable. This limits the data size involved in a single compaction.

Implementation

Based on the overview of LSM-Tree above, I believe you now have a basic understanding of LSM-Tree and some ideas about its implementation. Next, we will build a storage engine based on LSM-Tree from scratch. Below, we will introduce only the core code; for the complete code, please refer to https://github.com/B1NARY-GR0UP/originium.

We will break down the implementation of LSM-Tree into the following core components and implement them one by one:

  • Skip List
  • WAL
  • Memtable
  • SSTable
  • K-Way Merge
  • Bloom Filter
  • Leveled Compaction

Skip List

In the process of introducing data writing, we mentioned that the LSM-Tree first writes data into an in-memory ordered data structure. Some common ordered data structures and the time complexity of their operations are as follows:

Data Structure Insert Delete Search Traverse
Skip List O(log⁡n) O(log⁡n) O(log⁡n) O(n)
AVL Tree O(log⁡n) O(log⁡n) O(log⁡n) O(n)
Red-Black Tree O(log⁡n) O(log⁡n) O(log⁡n) O(n)

We chose the Skip List for two main reasons: it is simpler to implement and maintain (KISS principle), and the underlying linked list structure facilitates sequential traversal, making it easier to persist in-memory data to disk.

Core Struct

The complete implementation of the Skip List is available at https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/skiplist/skiplist.go .

A Skip List consists of a base linked list and multiple levels of indices built on top of it. For large datasets, the index layers significantly shorten the search path.

In our implementation, the core structure of the Skip List is defined as follows:

type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}
  • maxLevel: The maximum number of levels in the Skip List (the base linked list has one level).
  • level: The current number of levels in the Skip List.
  • p: The probability of a node being promoted to a higher level. For example, if p = 0.5, a linked list with 10 nodes at the base level will have approximately 5 nodes in the next level of indices.
  • rand: A random number generator used to compare against p.
  • size: The number of key-value pairs stored in the Skip List, used to determine if the Memtable exceeds its size threshold.
  • head: The head node, which holds references to the first node in each level.

The structure of the elements stored in the Skip List is defined as follows:

type Element struct {
    types.Entry
    next []*Element
}

// https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go
type Entry struct {
    Key       string
    Value     []byte
    Tombstone bool  
}
  • types.Entry represents a key-value pair in the storage engine, including the key, value, and a tombstone flag for deletion.

  • next: Contains pointers to the next element at each level.

This structure may seem abstract, so let's illustrate it with an example:

Level 3:       3 ----------- 9 ----------- 21 --------- 26
Level 2:       3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26
Level 1:       3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26

next of head [ ->3, ->3, ->3 ]
next of Element 3 [ ->6, ->6, ->9 ]
next of Element 6 [ ->7, ->9 ]

In this three-level Skip List, the next pointers of the head node reference the first node at each level. Elements 3 and 6 store the next element for each of their levels.

For example, if we want to find the next node of element 19 at Level 2, we use e19.next[2-1].

Set

func (s *SkipList) Set(entry types.Entry)

The LSM-Tree uses tombstones to perform deletions, so we don’t need a Delete method in the skip list implementation. To delete an element, simply set the Tombstone of the entry to true. Thus, the Set method handles inserting new key-value pairs, updating existing ones, and deleting elements.

Let’s explore the implementation of the Set method. By traversing the nodes in each level from the highest, the last element smaller than the key to be set is saved in the update slice.

type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}

At the end of this traversal, curr points to the last element smaller than the key to be set in the bottom-level linked list. So, we only need to check if the next element of curr equals the key we want to set. If it matches, the element has already been inserted; we update the existing element and return.

type Element struct {
    types.Entry
    next []*Element
}

// https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go
type Entry struct {
    Key       string
    Value     []byte
    Tombstone bool  
}

If the element is not found, it is inserted as a new element. Using randomLevel, we calculate the index level of this element. If it exceeds the current number of levels in the skip list, we add the head node to the update slice and update s.level to the new level count.

Level 3:       3 ----------- 9 ----------- 21 --------- 26
Level 2:       3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26
Level 1:       3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26

next of head [ ->3, ->3, ->3 ]
next of Element 3 [ ->6, ->6, ->9 ]
next of Element 6 [ ->7, ->9 ]

Next, construct the element to be inserted, and the next pointers of each level are updated to complete the insertion.

func (s *SkipList) Set(entry types.Entry)

Get

The skip list can perform fast search operations by relying on multiple layers of indexes. The nested for loops in the implementation represent the index-based search operation. If the corresponding element is eventually found in the bottom-level linked list, it will be returned.

curr := s.head
update := make([]*Element, s.maxLevel)

for i := s.maxLevel - 1; i >= 0; i-- {
    for curr.next[i] != nil && curr.next[i].Key < entry.Key {
        curr = curr.next[i]
    }
    update[i] = curr
}

All

One reason we chose the skip list is its convenient sequential traversal, which is made possible by simply traversing the bottom-level linked list.

// update entry
if curr.next[0] != nil && curr.next[0].Key == entry.Key {
    s.size += len(entry.Value) - len(curr.next[0].Value)

    // update value and tombstone
    curr.next[0].Value = entry.Value
    curr.next[0].Tombstone = entry.Tombstone
    return
}

WAL

The complete implementation of WAL can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/wal/wal.go.

As mentioned earlier, the purpose of WAL (Write-Ahead Logging) is to prevent data loss in the Memtable caused by database crashes. Therefore, WAL needs to record operations on the Memtable and recover the Memtable from the WAL file when the database restarts.

Core Struct

The core structure of WAL is as follows, where fd stores the file descriptor of the WAL file:

// add entry
level := s.randomLevel()

if level > s.level {
    for i := s.level; i < level; i++ {
        update[i] = s.head
    }
    s.level = level
}

Write

Since we need to record operations on the Memtable, this essentially involves writing each operation (Set, Delete) as an Entry into the WAL. The definition of the Write method is as follows:

e := &Element{
    Entry: types.Entry{
        Key:       entry.Key,
        Value:     entry.Value,
        Tombstone: entry.Tombstone,
    },
    next: make([]*Element, level),
}

for i := range level {
    e.next[i] = update[i].next[i]
    update[i].next[i] = e
}
s.size += len(entry.Key) + len(entry.Value) + int(unsafe.Sizeof(entry.Tombstone)) + len(e.next)*int(unsafe.Sizeof((*Element)(nil)))

When writing these entries to the file, we need to standardize the WAL file format. The format we use here is length data. First, we serialize the Entry, then calculate the length of the serialized data, and finally write the length and serialized data sequentially into the WAL file.

The core code is as follows:

func (s *SkipList) Get(key types.Key) (types.Entry, bool) {
    curr := s.head

    for i := s.maxLevel - 1; i >= 0; i-- {
        for curr.next[i] != nil && curr.next[i].Key < key {
            curr = curr.next[i]
        }
    }

    curr = curr.next[0]

    if curr != nil && curr.Key == key {
        return types.Entry{
            Key:       curr.Key,
            Value:     curr.Value,
            Tombstone: curr.Tombstone,
        }, true
    }
    return types.Entry{}, false
}

Read

Since we use the WAL file format length data, during reading, we first read 8 bytes (int64) to obtain the length of the data, and then read the data based on this length and deserialize it to retrieve the Entry.

The core code is as follows:

type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}

Memtable

The complete implementation of Memtable can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/memtable.go.

The Memtable is responsible for writing client operations into the skip list and recording them in the WAL. It can also recover data from the WAL when the database starts.

Core Struct

The core structure of the Memtable is as follows, which includes two main components skiplist and wal:

type Element struct {
    types.Entry
    next []*Element
}

// https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go
type Entry struct {
    Key       string
    Value     []byte
    Tombstone bool  
}

Set

When performing a Set operation, both the skip list and the WAL need to be updated simultaneously.

Level 3:       3 ----------- 9 ----------- 21 --------- 26
Level 2:       3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26
Level 1:       3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26

next of head [ ->3, ->3, ->3 ]
next of Element 3 [ ->6, ->6, ->9 ]
next of Element 6 [ ->7, ->9 ]

Get

To retrieve a value, simply return the result of the skip list's Get operation.

func (s *SkipList) Set(entry types.Entry)

Recover

Recovering the Memtable from the WAL file involves first reading the WAL file, then sequentially applying the Entry records from the WAL file to the Memtable, and finally deleting the recovered WAL file.

Retrieve the list of WAL files:

curr := s.head
update := make([]*Element, s.maxLevel)

for i := s.maxLevel - 1; i >= 0; i-- {
    for curr.next[i] != nil && curr.next[i].Key < entry.Key {
        curr = curr.next[i]
    }
    update[i] = curr
}

Read the WAL and recover the Memtable:

// update entry
if curr.next[0] != nil && curr.next[0].Key == entry.Key {
    s.size += len(entry.Value) - len(curr.next[0].Value)

    // update value and tombstone
    curr.next[0].Value = entry.Value
    curr.next[0].Tombstone = entry.Tombstone
    return
}

SSTable

LevelDB SSTable

In the previous introduction, we only mentioned that "SSTable (Sorted String Table) is a data storage format that maintains a series of ordered key-value pairs." Here, we will provide a more detailed explanation of the structure of SSTable.

In LevelDB, an SSTable consists of multiple blocks with different purposes. A schematic diagram is shown below:

Building an LSM-Tree Storage Engine from Scratch

  • Data Block: Stores a sequence of ordered key-value pairs.
  • Meta Block: Includes two types: filter and stats. The filter type stores data for Bloom filters, while the stats type stores statistical information about the Data Blocks.
  • MetaIndex Block: Stores index information for the Meta Blocks.
  • Index Block: Stores index information for the Data Blocks.
  • Footer: Fixed in length, it stores the index information of the MetaIndex Block and Index Block, as well as a magic number.

The index information is essentially a pointer structure called a BlockHandle, which includes two attributes: offset and size, used to locate the corresponding Block.

Our SSTable

In our implementation of SSTable, we have simplified the LevelDB SSTable structure. A schematic diagram is shown below:

Building an LSM-Tree Storage Engine from Scratch

  • Data Block: Stores a sequence of ordered key-value pairs.
  • Meta Block: Stores some metadata for the SSTable.
  • Index Block: Stores index information for the Data Blocks.
  • Footer: Fixed in length, it stores the index information of the Meta Block and Index Block.

The complete implementation of SSTable can be found at https://github.com/B1NARY-GR0UP/originium/tree/main/sstable.

Data Block

The structure of the Data Block is defined as follows, storing an ordered sequence of entries.

type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}

We implemented three primary methods for the Data Block:

  • Encode: Encodes the Data Block into binary data.
type Element struct {
    types.Entry
    next []*Element
}

// https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go
type Entry struct {
    Key       string
    Value     []byte
    Tombstone bool  
}

We use prefix compression to encode the key-value sequence. In the buffer, we sequentially write the length of the common prefix, the length of the suffix, the suffix itself, the length of the value, the value, and the "tombstone" marker.

Level 3:       3 ----------- 9 ----------- 21 --------- 26
Level 2:       3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26
Level 1:       3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26

next of head [ ->3, ->3, ->3 ]
next of Element 3 [ ->6, ->6, ->9 ]
next of Element 6 [ ->7, ->9 ]

Finally, we compress the data using s2.

S2 is a high-performance extension of the Snappy compression algorithm.

func (s *SkipList) Set(entry types.Entry)
  • Decode: Decodes binary data into a Data Block.
curr := s.head
update := make([]*Element, s.maxLevel)

for i := s.maxLevel - 1; i >= 0; i-- {
    for curr.next[i] != nil && curr.next[i].Key < entry.Key {
        curr = curr.next[i]
    }
    update[i] = curr
}

During decoding, the process is simply reversed. The full key-value pair is reconstructed using the prefix and suffix.

// update entry
if curr.next[0] != nil && curr.next[0].Key == entry.Key {
    s.size += len(entry.Value) - len(curr.next[0].Value)

    // update value and tombstone
    curr.next[0].Value = entry.Value
    curr.next[0].Tombstone = entry.Tombstone
    return
}
  • Search: Locates key-value pairs using binary search.
// add entry
level := s.randomLevel()

if level > s.level {
    for i := s.level; i < level; i++ {
        update[i] = s.head
    }
    s.level = level
}

Index Block

The structure of the Index Block is defined as follows. It stores the first and last key of each Data Block, along with the BlockHandle of the corresponding Data Block.

e := &Element{
    Entry: types.Entry{
        Key:       entry.Key,
        Value:     entry.Value,
        Tombstone: entry.Tombstone,
    },
    next: make([]*Element, level),
}

for i := range level {
    e.next[i] = update[i].next[i]
    update[i].next[i] = e
}
s.size += len(entry.Key) + len(entry.Value) + int(unsafe.Sizeof(entry.Tombstone)) + len(e.next)*int(unsafe.Sizeof((*Element)(nil)))

Similarly, the Index Block implements three primary methods: Encode, Decode, and Search. The implementation ideas for the Encode and Decode methods are essentially the same, so we will focus on the Search method.

The Search method of the Data Block is designed to locate a specific key-value pair within the ordered key-value sequence stored in a single Data Block. In contrast, the Search method of the Index Block is used to locate the Data Block containing the given key within the entire SSTable.

func (s *SkipList) Get(key types.Key) (types.Entry, bool) {
    curr := s.head

    for i := s.maxLevel - 1; i >= 0; i-- {
        for curr.next[i] != nil && curr.next[i].Key < key {
            curr = curr.next[i]
        }
    }

    curr = curr.next[0]

    if curr != nil && curr.Key == key {
        return types.Entry{
            Key:       curr.Key,
            Value:     curr.Value,
            Tombstone: curr.Tombstone,
        }, true
    }
    return types.Entry{}, false
}

Meta Block And Footer

func (s *SkipList) All() []types.Entry {
    var all []types.Entry

    for curr := s.head.next[0]; curr != nil; curr = curr.next[0] {
        all = append(all, types.Entry{
            Key:       curr.Key,
            Value:     curr.Value,
            Tombstone: curr.Tombstone,
        })
    }

    return all
}

The implementations of these two Blocks are quite straightforward, with both requiring only the Encode and Decode methods.

Build

After introducing all the Blocks in our SSTable, constructing an SSTable simply involves building each Block step by step based on the key-value pairs. Finally, the in-memory index and the encoded SSTable are returned.

type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}

K-Way Merge

The complete implementation of K-Way Merge is available at https://github.com/B1NARY-GR0UP/originium/tree/main/pkg/kway.

In the conceptual section, we illustrate the process of compressing and merging multiple SSTables through diagrams. This process is accomplished using the k-way merge algorithm.

The k-way merge algorithm is a method to merge k sorted sequences into a single sorted sequence, with a time complexity of O(knlogk).

One implementation of this algorithm uses a min-heap as an auxiliary structure:

  • Insert the first element of each sequence into the heap.
  • Pop the smallest value from the heap and add it to the result set. If the popped element's sequence still has more elements, insert the next element from that sequence into the heap.
  • Repeat this process until all elements from all sequences are merged.

Heap

The standard library provides a heap implementation in container/heap. By implementing the heap.Interface, we can build a min-heap.

  • First, define the basic structure of the min-heap. A slice is used to store the elements. Each element includes not only an Entry but also an LI to indicate which sorted sequence this element originates from.
type Element struct {
    types.Entry
    next []*Element
}

// https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go
type Entry struct {
    Key       string
    Value     []byte
    Tombstone bool  
}
  • Implement the sort.Interface methods to sort elements in the heap. Special attention is needed for the Less method: by comparing the LI of the elements, we ensure that when elements have the same key, those from earlier sequences are ordered first. This facilitates deduplication when merging elements into the result set. This requirement also means that the sorted sequences should be arranged in order from oldest to newest when using the k-way merge algorithm.
Level 3:       3 ----------- 9 ----------- 21 --------- 26
Level 2:       3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26
Level 1:       3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26

next of head [ ->3, ->3, ->3 ]
next of Element 3 [ ->6, ->6, ->9 ]
next of Element 6 [ ->7, ->9 ]
  • Finally, implement the Push and Pop methods. Push appends an element to the end of the slice, while Pop removes the last element from the slice.
func (s *SkipList) Set(entry types.Entry)

Merge

The function definition of the Merge method:

curr := s.head
update := make([]*Element, s.maxLevel)

for i := s.maxLevel - 1; i >= 0; i-- {
    for curr.next[i] != nil && curr.next[i].Key < entry.Key {
        curr = curr.next[i]
    }
    update[i] = curr
}

Follows the k-way merge algorithm process.

  • First, insert the first element of each sorted sequence into the min-heap.
type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}
  • Iteratively pop an element from the min-heap and add it to the result queue. If the popped element’s sequence still has more elements, insert the next element from that sequence into the heap. Here, a map is used instead of a result sequence. The map automatically handles deduplication, with newer keys always overwriting older ones.
type Element struct {
    types.Entry
    next []*Element
}

// https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go
type Entry struct {
    Key       string
    Value     []byte
    Tombstone bool  
}

Finally, traverse the map to add elements to the result queue, removing any key-value pairs marked as "tombstones." Since the map is unordered, the result queue needs to be sorted before being returned.

Level 3:       3 ----------- 9 ----------- 21 --------- 26
Level 2:       3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26
Level 1:       3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26

next of head [ ->3, ->3, ->3 ]
next of Element 3 [ ->6, ->6, ->9 ]
next of Element 6 [ ->7, ->9 ]

Bloom Filter

The complete implementation of the Bloom Filter can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/filter/filter.go.

A Bloom Filter is a data structure that efficiently checks whether an element is a member of a set.

  • It uses a bit array and multiple hash functions.
  • When adding an element, the element is hashed using multiple hash functions, which map it to different positions in the bit array, setting those positions to 1.
  • During a query, if all positions mapped by the hash functions are 1, the element might exist. If any position is 0, the element definitely does not exist.

The time complexity for both insertion and query operations is O(k), where k is the number of hash functions. False positives may occur (the Bloom Filter predicts that an element is in the set, but it is not), but false negatives cannot occur (the Bloom Filter predicts that an element is not in the set, but it actually is).

True Positive (TP): The system predicts an event as "positive," and it is indeed positive.
False Positive (FP): The system predicts an event as "positive," but it is actually negative.
True Negative (TN): The system predicts an event as "negative," and it is indeed negative.
False Negative (FN): The system predicts an event as "negative," but it is actually positive.

Core Struct

The core structure of a Bloom Filter includes the bit array (which can be optimized to use uint8) and multiple hash functions.

func (s *SkipList) Set(entry types.Entry)

New

The method to create a Bloom Filter instance accepts two parameters: n (the expected number of elements) and p (the desired false positive rate).

First, the parameters are validated. Then, the size of the bit array (m) and the number of hash functions (k) are calculated using specific formulas. Finally, the bit array and hash functions are initialized based on m and k.

type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}

Add

When adding an element, all hash functions are used to compute hash values for the key. These values are then mapped to indices in the bit array, and the corresponding positions are set to true.

type Element struct {
    types.Entry
    next []*Element
}

// https://github.com/B1NARY-GR0UP/originium/blob/main/pkg/types/entry.go
type Entry struct {
    Key       string
    Value     []byte
    Tombstone bool  
}

Contains

To check if a key is in the set, the hash functions compute hash values and map them to indices in the bit array. If any of these positions is not true, the element is not in the set, and false is returned.

Level 3:       3 ----------- 9 ----------- 21 --------- 26
Level 2:       3 ----- 6 ---- 9 ------ 19 -- 21 ---- 25 -- 26
Level 1:       3 -- 6 -- 7 -- 9 -- 12 -- 19 -- 21 -- 25 -- 26

next of head [ ->3, ->3, ->3 ]
next of Element 3 [ ->6, ->6, ->9 ]
next of Element 6 [ ->7, ->9 ]

Leveled Compaction

The complete implementation of Leveled Compaction can be found at https://github.com/B1NARY-GR0UP/originium/blob/main/level.go.

After implementing components like K-Way Merge and Bloom Filter, we can complete the final part of our implementation, the most crucial SSTable compaction process in the LSM-Tree storage engine. This implementation follows the Leveled Compaction Strategy described in the "Data Compaction" section.

In the Leveled Compaction Strategy, SSTables are organized into multiple levels (Level 0 - Level N). We need a structure to store this information and manage SSTables across different levels.

Thus, we implemented a structure called levelManager. We use a []*list.List to store SSTable information for each level, where the index of the slice corresponds to the level. Each element in the slice is a list.List, a doubly-linked list that holds information about all SSTables in a specific level.

func (s *SkipList) Set(entry types.Entry)

compactLN

The compactLN method is responsible for Level N to Level N 1 (N > 0) compaction. It selects the oldest SSTable from LN and all SSTables from LN 1 that have overlapping key ranges with this SSTable.

curr := s.head
update := make([]*Element, s.maxLevel)

for i := s.maxLevel - 1; i >= 0; i-- {
    for curr.next[i] != nil && curr.next[i].Key < entry.Key {
        curr = curr.next[i]
    }
    update[i] = curr
}

The selected SSTables are processed in order from oldest to newest. Key-value pairs from their data blocks are added to a two-dimensional slice and then merged using the K-Way Merge algorithm.

// update entry
if curr.next[0] != nil && curr.next[0].Key == entry.Key {
    s.size += len(entry.Value) - len(curr.next[0].Value)

    // update value and tombstone
    curr.next[0].Value = entry.Value
    curr.next[0].Tombstone = entry.Tombstone
    return
}

With the merged key-value pairs, we construct a new Bloom Filter and SSTable. The relevant information about the new SSTable is appended to the end of Level N 1.

// add entry
level := s.randomLevel()

if level > s.level {
    for i := s.level; i < level; i++ {
        update[i] = s.head
    }
    s.level = level
}

Finally, the old SSTables are deleted, and the newly constructed SSTable is written to disk.

e := &Element{
    Entry: types.Entry{
        Key:       entry.Key,
        Value:     entry.Value,
        Tombstone: entry.Tombstone,
    },
    next: make([]*Element, level),
}

for i := range level {
    e.next[i] = update[i].next[i]
    update[i].next[i] = e
}
s.size += len(entry.Key) + len(entry.Value) + int(unsafe.Sizeof(entry.Tombstone)) + len(e.next)*int(unsafe.Sizeof((*Element)(nil)))

The compactL0 method handles Level 0 to Level 1 compaction. Unlike compactLN, it selects not only one SSTable from L0 but also all overlapping SSTables in L0. The rest of the process is identical to compactLN.

search

The search method locates the corresponding key-value pairs across all SSTables. It starts from L0 and iterates through each level up to LN. By leveraging the Bloom Filter and the ordered structure of SSTables, SSTables that do not contain the desired key-value pairs can be skipped efficiently.

type SkipList struct {
    maxLevel int
    p        float64
    level    int
    rand     *rand.Rand
    size     int
    head     *Element
}

DB

With this, we have implemented all core components of the LSM-Tree-based storage engine. By assembling these components as described in the LSM-Tree introduction, we can finalize the database interface.

  • Complete code: https://github.com/B1NARY-GR0UP/originium/blob/main/db.go

  • Documentation: https://github.com/B1NARY-GR0UP/originium?tab=readme-ov-file#usage

Summary

We began by understanding LSM-Tree, familiarizing ourselves with its core components and the process of handling client requests. Ultimately, we built our own LSM-Tree storage engine from scratch.

Of course, this implementation is just a prototype. A production-grade storage engine requires consideration of many more details. ORIGINIUM will continue to receive optimizations and improvements in the future. Hope this article and ORIGINIUM help deepen your understanding of LSM-Tree.

That concludes everything covered in this article. If there are any errors or questions, feel free to reach out via private message or leave a comment. Thank you!

Reference

  • https://github.com/B1NARY-GR0UP/originium
  • https://www.cnblogs.com/whuanle/p/16297025.html
  • https://vonng.gitbook.io/vonng/part-i/ch3#sstables-he-lsm-shu
  • https://github.com/google/leveldb/blob/main/doc/table_format.md

The above is the detailed content of Building an LSM-Tree Storage Engine from Scratch. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn