Mind the Buffer

LSM-trees power the storage layer of virtually every major NoSQL key-value store — RocksDB at Meta, LevelDB at Google, WiredTiger at MongoDB, Cassandra and HBase at Apache. Every single operation — read or write — passes through the in-memory buffer before touching disk. Despite being this critical, the LSM-buffer design space is vast and largely unexplored. Production systems offer users a handful of buffer choices with no systematic guidance on when each one is appropriate.

This work is an extension of our earlier DBTest 2024 workshop paper that first benchmarked four buffer implementations across three operation types. The extended study dissects the entire design space — nine buffer implementations, 1,200+ experiments, and a practitioner handbook.

The Hidden Bottleneck

The in-memory buffer is the entry point for every operation in an LSM-engine. Its choice of data structure, access method, and tuning directly determines:

Ingestion throughput: how fast data can be written before a flush to storage
Query performance: the cost of point and range lookups against buffered, uncompacted data
Flush frequency: how often data moves to storage, which drives compaction overhead
Tail latency: write stalls caused by compaction debt and buffer resizing

Even with the same buffer size, switching from one implementation to another can change performance by several orders of magnitude under the same workload. No single buffer wins across all conditions.

Nine Buffer Implementations

We evaluate four existing implementations from commercial LSM-engines and introduce five new ones:

Vector variants (lowest memory overhead, best for writes):

V-Qsort: appends on insert, sorts snapshot on each query (RocksDB default)
V-Qscan (new): appends on insert, linear backward scan on point queries
V-Sorted (new): maintains sorted order on insert; binary search for queries; no sorting

Node-based:

Link-L: doubly linked sorted list; high random access cost, rarely used standalone
Skip-L: classical probabilistic skip-list; logarithmic reads and writes
InSkip-L: RocksDB’s default; inline key storage with splice caching for temporal locality

Prefix-hash hybrids (fastest for point queries, high memory cost):

Hash-SL: hash buckets backed by skip-lists
Hash-LL: hash buckets backed by linked-lists (lower memory, linear search)
Hash-V (new): hash buckets backed by sorted vectors; lowest metadata overhead among hybrids

The Adaptive Buffer

No static buffer choice handles workload shifts well. We design an Adaptive buffer that monitors operation composition over a sliding window and switches to the best-fit implementation at each flush boundary. Across a seven-phase workload spanning insert-heavy, read-heavy, update-intensive, and scan-heavy phases, the Adaptive buffer matches the best specialist in each phase — delivering 1.14 MOpS in the write phase, switching to Hash-LL for read-heavy phases, and switching back to V-Qscan when ingestion resumes.

Adaptive buffer throughput across seven workload phases

Practitioner Handbook

From 1,200+ experiments, we distill 10 concrete guidelines:

Write-heavy workloads → V-Qscan (constant insert, no sort on PQ)
Memory-constrained → vector variants (15% overhead vs. 50%+ for hash-hybrids)
Fewer flushes and compactions → vector variants
Unpredictable workloads → InSkip-L (balanced across all types)
PQ-heavy with sufficient memory → hash-hybrids, prefer Hash-V
Avoid V-Qsort if any point queries exist → sorting overhead is prohibitive
Static allocation → prevents latency spikes from dynamic resizing
Set low_pri to reduce compaction debt → unset accumulates shallower levels
Larger buffer sizes → reduce stalls for node-based and hash-hybrid structures
Shifting workloads → use the Adaptive buffer

Resources

📄 View the Paper
🐙 GitHub Repository

Log-structured merge (LSM)-trees are widely used as the storage layer data structure in modern NoSQL key-value stores due to their superior ingestion throughput, competitive query performance, and efficient space utilization. To enable fast ingestion, LSM-based storage engines first batch the incoming data in an in-memory buffer and then, opportunistically write the data to slower secondary storage as a collection of immutable sorted runs. We point out that while the data is largely storage-resident, the overall performance of an LSM-based storage engine is critically bottlenecked by the (i) implementation, (ii) tuning, and (iii) size of the in-memory buffer. In fact, even with the same buffer configuration, the performance of an LSM-engine may vary by several orders of magnitude if there is a shift in workload. Choosing the appropriate buffer design, thus, is crucial for performance, but, at the same time, hard, as the LSM-buffer design space is vast and largely unexplored. In this paper, we evaluate and analyze the performance of LSM-engines with nine different buffer implementations – including three hash-hybrid designs – while varying the tuning and size for each implementation to understand their implications on performance and the tradeoffs associated. For each configuration, we further vary the workload composition and distribution, the LSM-tuning, and the underlying hardware to quantify their impact on performance. To the best of our knowledge, this is the first comprehensive analysis of LSM-buffer design space. Finally, for the practitioners, we present a handbook with 10 key guidelines for choosing the appropriate buffer design and tuning for a given workload and performance target.

Log-structured merge (LSM) tree is an ingestion-optimized data structure that is widely used in modern NoSQL key-value stores. To support high throughput for writes, LSM-trees maintain an in-memory buffer that absorbs the incoming entries before writing them to slower secondary storage. We point out that the choice of the data structure and implementation of the memory buffer has a significant impact on the overall performance of LSM-based storage engines. In fact, even with the same implementation of the buffer, the performance of a storage engine can vary by up to several orders of magnitude if there is a shift in the input workload.

The Hidden Bottleneck

Nine Buffer Implementations

The Adaptive Buffer

Practitioner Handbook

Resources

References

2026

2024