Publications
Building brick by brick
2026
- In ReviewMind the Buffer: Dissecting the LSM-Buffer Design SpaceBoao Chen, Shubham Kaushik, and Subhadeep Sarkar2026
Log-structured merge (LSM)-trees are widely used as the storage layer data structure in modern NoSQL key-value stores due to their superior ingestion throughput, competitive query performance, and efficient space utilization. To enable fast ingestion, LSM-based storage engines first batch the incoming data in an in-memory buffer and then, opportunistically write the data to slower secondary storage as a collection of immutable sorted runs. We point out that while the data is largely storage-resident, the overall performance of an LSM-based storage engine is critically bottlenecked by the (i) implementation, (ii) tuning, and (iii) size of the in-memory buffer. In fact, even with the same buffer configuration, the performance of an LSM-engine may vary by several orders of magnitude if there is a shift in workload. Choosing the appropriate buffer design, thus, is crucial for performance, but, at the same time, hard, as the LSM-buffer design space is vast and largely unexplored. In this paper, we evaluate and analyze the performance of LSM-engines with nine different buffer implementations – including three hash-hybrid designs – while varying the tuning and size for each implementation to understand their implications on performance and the tradeoffs associated. For each configuration, we further vary the workload composition and distribution, the LSM-tuning, and the underlying hardware to quantify their impact on performance. To the best of our knowledge, this is the first comprehensive analysis of LSM-buffer design space. Finally, for the practitioners, we present a handbook with 10 key guidelines for choosing the appropriate buffer design and tuning for a given workload and performance target.
- VLDBTexBench: A Unified Benchmarking Suite for Shifting WorkloadsAbhishek Chanda*, Shubham Kaushik*, Artem Lavrov, and 1 more author52nd International Conference on Very Large Data Bases, Sep 2026* Equal contribution
Key-value stores are widely adopted as the storage engine for modern applications as they offer high throughput for writes, support for heterogeneous workloads, and easy tunability. Given the large number of key-value stores available and how widely their performance varies with workload characteristics, finding the suitable data store and tuning for a specific workload and performance target often entails extensive benchmarking and analysis. State-of-the-art key-value benchmarks, such as YCSB, db_bench, and KVBench, however, are unable to capture several key characteristics of modern application workloads, such as dynamically shifting workload characteristics, data with varied degrees of sortedness, or application-specific data formats. Further, existing benchmarking tools do not provide a unified interface to benchmark and compare multiple databases against the same workload. We present TexBench, a unified key-value benchmarking suite that enables benchmarking key-value stores against dynamically shifting and production-like workloads and comparing their performance side by side. TexBench is built on top of Tectonic, a highly configurable, Rust-based key-value workload generator that can generate multi-phased shifting workloads, supports a rich set of operations and operation-specific distributions, variable data sortedness, and custom data formats. TexBench’s unified framework also enables its users to perform an apples-to-apples comparison of multiple databases against the same workload, and compare the benchmarking results readily within a single interface. Lastly, we augment TexBench with an LLM core that allows users to describe a workload in natural language, have that translated into a custom key-value workload, and benchmark and compare multiple databases against it in parallel.
- ICDERangeReduce: Query-Driven LSM CompactionsShubham Kaushik, Manos Athanassoulis, and Subhadeep Sarkar42nd IEEE International Conference on Data Engineering, May 2026
Log-structured merge (LSM) trees are widely used in the storage layer of modern ingestion-optimized data stores. The high ingestion throughput, however, comes at the cost of sub-optimal range query (RQ) performance. This is because LSM-trees arrange the data as a hierarchical collection of sorted runs, which implies that every RQ must (i) probe all sorted runs to locate the qualifying entries, (ii) scan and merge the entries from all qualifying runs, (iii) filter out the logically invalidated entries by updates and deletes on the fly, and (iv) return the most recent version of each qualifying key. This leads to high read amplification and significant redundant work in terms of superfluous I/Os to storage and wasted CPU cycles, which is exacerbated in the presence of updates and deletes. Additionally, during compactions, the same data is read and written multiple times, further amplifying the read and write amplification. In this paper, we introduce RangeReduce, an RQ-optimized LSM-engine that uses RQs as a hint to compact data that is already read into memory as part of the query, and thereby, improves the overall performance of the storage engine.
2025
- TPCTCTectonic: Bridging Synthetic and Real-World Workloads for Key-Value BenchmarkingAlexander H. Ott*, Shubham Kaushik*, Boao Chen, and 1 more author17th TPC Technology Conference on Performance Evaluation & Benchmarking, Sep 2025* Equal contribution
Key-value stores are the backbone of many modern SQL- and NoSQL-based data systems, serving a variety of real-world applications. Despite their widespread adoption, existing key-value benchmarks fall short across multiple dimensions when accurately replicating complex and dynamic real-world workloads. In this paper, we introduce Tectonic, a Rust-based, highly configurable, and resource-efficient key-value workload generator designed to model the temporal, structural, and dynamic properties of real-world workloads. Tectonic offers fine-grained control over data access patterns, configurable composite key generation, dynamic workload generation, and generation of workloads with user-specified data sortedness — at 2x higher throughput and up to 84% lower memory footprint than the state-of-the-art.
2024
- DBTestAnatomy of LSM Memory Buffer: Insights & ImplicationsShubham Kaushik, and Subhadeep SarkarProceedings of the Tenth International Workshop on Testing Database Systems, Jun 2024
Log-structured merge (LSM) tree is an ingestion-optimized data structure that is widely used in modern NoSQL key-value stores. To support high throughput for writes, LSM-trees maintain an in-memory buffer that absorbs the incoming entries before writing them to slower secondary storage. We point out that the choice of the data structure and implementation of the memory buffer has a significant impact on the overall performance of LSM-based storage engines. In fact, even with the same implementation of the buffer, the performance of a storage engine can vary by up to several orders of magnitude if there is a shift in the input workload.
2019
- JCSEFault Modelling of an Object-Oriented System using CPNShubham Kaushik, and RatneshwerInternational Journal of Computer Sciences and Engineering, May 2019
Object-oriented development is a mechanism in which objects provide services to other objects by various means like inheritance, polymorphism, etc. Faults, in object-oriented software, may occur at two levels i.e. object level and interaction level. In this paper, an attempt has been made to model several faults, in an object-oriented system, with the help of Colored Petri Nets.