FAST '26 - Rearchitecting Buffered I/O in the Era of High-Bandwidth SSDs

USENIX Association
USENIX AssociationApr 7, 2026

Why It Matters

By eliminating page‑cache bottlenecks, WS‑Buffer enables enterprise storage to fully exploit high‑bandwidth SSDs, delivering faster, more cost‑effective data services.

Key Takeaways

  • SSD bandwidth grew 56×, challenging traditional buffered I/O.
  • Page cache management overhead limits buffered I/O throughput.
  • Partial-page writes incur 1.15–64× higher latency than full pages.
  • WS‑Buffer redesign cuts memory use and eliminates read‑before‑write.
  • WS‑Buffer delivers up to 6.3× latency improvement and 4.5× throughput gains.

Summary

The presentation, delivered by Chao of Hajing University of Science Technology, tackles the growing mismatch between buffered I/O architectures and today’s ultra‑high‑bandwidth SSDs. Over the past 15 years, SSD throughput has leapt from roughly 500 MB/s to 28 GB/s—a 56‑fold increase—rendering the legacy page‑cache‑centric buffered I/O model increasingly inefficient.

The authors identify three core bottlenecks: costly page‑cache allocation and state‑maintenance, excessive memory consumption to sustain write throughput, and the read‑before‑write penalty that inflates latency for partial‑page writes by up to 64×. Experiments comparing buffered I/O, direct I/O, and hybrid schemes on eight PCIe 4.0 Samsung drives show that conventional buffered I/O lags direct I/O by 1.1‑4.5×, especially under write‑intensive workloads.

To address these issues, the team proposes WS‑Buffer, a rearchitected buffered I/O layer that introduces a lightweight scratch buffer, a two‑stage “OT‑Flash” mechanism, and contention‑aware page management. Benchmarks on XFS/Linux 6.8 reveal up to 6.3× lower write latency for full‑page writes and up to 2.18× improvement for partial writes, with memory usage cut by as much as 99.6% and CPU utilization reduced by 28.4%.

The findings suggest that storage stacks can retain the usability of buffered I/O while unlocking the full potential of modern SSDs. Data‑center operators and file‑system developers are urged to consider WS‑Buffer‑style designs to achieve higher throughput, lower latency, and better resource efficiency as SSD bandwidth continues to close the gap with DRAM.

Original Description

Rearchitecting Buffered I/O in the Era of High-Bandwidth SSDs
Yekang Zhan, Tianze Wang, Zheng Peng, Haichuan Hu, Jiahao Wu, Xiangrui Yang, and Qiang Cao, Huazhong University of Science and Technology; Hong Jiang, University of Texas at Arlington; Jie Yao, Huazhong University of Science and Technology
Buffered I/O via page cache has been prevalently used by applications for decades due to its user-friendliness and high performance. However, the existing buffered I/O architecture fails to effectively utilize high-bandwidth Solid-State Drives (SSDs) caused by 1) costly page caching overused for buffering all incoming writes in the critical path, 2) the limited concurrency of page management, and 3) the high read-before-write penalty for partial-page writes.
This paper rearchitects buffered I/O and proposes a write-scrap buffering approach (WSBuffer) to remove the aforementioned shackles of buffered I/O on writes to proactively exploit fast SSDs while retaining all the advantages of buffered I/O on reads. WSBuffer first presents a novel memory-page buffering structure, scrap buffer, to efficiently buffer SSD-I/O unfriendly writes and expensive partial-page writes. WSBuffer further proposes a buffer-minimized data access mechanism to partially buffer small and unaligned parts of user writes via the scrap buffer while directly sending large and aligned parts to underlying SSDs. Finally, WSBuffer devises an opportunistic two-stage dirty-data flushing mechanism and a concurrent page management mechanism to achieve fluent and fast dirty-data flushing. The experimental results show that WSBuffer outperforms Linux file systems of EXT4, F2FS, BTRFS and XFS, as well as the state-of-the-art buffered I/O optimization of ScaleCache by up to 3.91X and 82.80X in throughput and latency respectively.

Comments

Want to join the conversation?

Loading comments...