Understanding & Designing Modern Storage Systems - M5: Processing Inside NAND Flash Memory
Why It Matters
FlashCosmos cuts data‑movement costs and improves energy efficiency, allowing data‑center operators to run large‑scale analytics directly in storage without sacrificing reliability.
Key Takeaways
- •Multi-word line sensing enables single‑read bulk bitwise ops.
- •FlashCosmos improves performance and energy efficiency over prior IFP.
- •Enhanced SLC programming increases voltage margin for reliable computation.
- •Evaluated on 160 real 3D NAND chips across three workloads.
- •Reduces data movement bottlenecks from storage to compute units.
Summary
The video introduces FlashCosmos, a new in‑flash processing technique that performs bulk bitwise operations directly inside NAND flash memory. Presented as part of a recent MICRO 2022 paper, the work targets the growing data‑movement bottleneck that hampers databases, graph analytics, cryptography and other data‑intensive workloads.
Conventional systems move data from storage to CPUs or GPUs, limited by PCIe‑Gen4’s ~8 GB/s external bandwidth, while near‑data and in‑storage processing still suffer from internal channel limits (~9.6 Gb/s) and serial sensing of operands. FlashCosmos replaces serial reads with a multi‑word‑line sensing (MWS) scheme that activates several word lines simultaneously, delivering a single‑read AND/OR operation. An enhanced SLC programming mode widens the voltage margin between erased and programmed states, boosting computational reliability.
The authors demonstrate the concept on 160 real 3D‑NAND chips and run system‑level simulations using a state‑of‑the‑art SSD simulator on three real‑world workloads. Results show up to 2‑3× speedup and comparable energy reductions versus the best prior in‑flash processing designs, while maintaining low raw bit‑error rates thanks to the larger voltage margin.
By eliminating repeated reads and reducing data transfers, FlashCosmos promises to reshape storage‑centric architectures, enabling faster, greener processing for workloads that exceed DRAM capacity. Its reliability improvements also make in‑flash compute viable for production environments, potentially accelerating the adoption of computational storage solutions.
Comments
Want to join the conversation?
Loading comments...