Streaming Datasets: 100x More Efficient

•October 27, 2025

Hugging Face•Oct 27, 2025

Why It Matters

This cuts data‑pipeline latency and storage costs for large‑scale model training, accelerating AI development cycles.

Key Takeaways

•100x fewer startup requests
•Data file resolution 10x faster
•Streaming speed up to 2x faster
•Prefetching for Parquet reduces GPU idle time
•Persistent cache eliminates request storms across workers

Pulse Analysis

Large‑scale machine‑learning projects have long wrestled with the logistics of moving terabytes of raw data into training pipelines. Traditional approaches rely on bulk downloads to local disks or on external object stores such as S3, both of which introduce latency spikes, storage overhead, and operational complexity. Hugging Face’s recent revamp of the `datasets` library tackles these pain points by re‑architecting the streaming layer. By consolidating file‑list resolution into a single persistent cache and bundling API calls, the system slashes the initial request storm that previously overwhelmed the Hub, turning a minutes‑long handshake into a near‑instant operation.

Under the hood, two performance‑focused features drive the headline numbers. First, a persistent data‑files cache shares the resolved file list across all DataLoader workers, eliminating redundant look‑ups and cutting startup requests by 100×. Second, the library now prefetches Parquet fragments while the model consumes the current chunk, keeping the I/O pipeline saturated. Users can also tune buffer block sizes and prefetch volumes, scaling request sizes from the default 32 MiB to 128 MiB for higher‑throughput networks. Coupled with Hugging Face’s Xet deduplication and Parquet Content‑Defined Chunking, these enhancements make remote streaming rival—or even surpass—local SSD read speeds on large GPU clusters.

The business implications are immediate. Faster, more reliable streaming removes the need for expensive local storage, reduces cloud egress fees, and shortens experiment turnaround from hours to minutes. Teams can now prototype on multi‑TB datasets without pre‑staging data, democratizing access to massive corpora for smaller organizations. As the AI community adopts these tools, we can expect a shift toward more agile, data‑centric development cycles, where the bottleneck moves from storage logistics to model architecture and algorithmic innovation.

Streaming datasets: 100x More Efficient

Published October 27, 2025

Authors:

Andres Marafioti
Quentin Lhoest
ben burtenshaw
Pedro Cuenca
merve

TLDR

We boosted load_dataset('dataset', streaming=True)—streaming datasets without downloading them—with one line of code!

Start training on multi‑TB datasets immediately, without complex setups, downloading, no “disk out of space”, or 429 “stop requesting!” errors. It’s super fast! Outrunning our local SSDs when training on 64 x H100 with 256 workers downloading data.

We've improved streaming to have 100x fewer requests → 10× faster data resolution → 2x sample/sec, → 0 worker crashes at 256 concurrent workers.

Loading data, especially at the terabyte scale, is a major pain in any machine‑learning workflow. We suffered this while training SmolLM3; at one point we had to wait 3 hours before each run to download enough data.

Streaming has always been possible in the datasets library, but large‑scale training with massive datasets remained a challenge. That changes today 🔥. We spent a few months improving the backend, focusing on streaming datasets to make it faster and more efficient.

What did we do exactly? ⤵️

Streaming: The Same Easy API

First things first: our changes are backwards compatible. You can still stream any dataset from the Hub with the same simple streaming=True flag. It's as easy as ever. 🚀


from datasets import load_dataset



# Stream a dataset instead of downloading it

dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)



# Get the first example

print(next(iter(dataset)))

Thousands of AI developers around the world use datasets daily; they should just get improved performance with zero extra work.

The Challenge: Streaming at Scale

Streaming was a lifesaver to quickly understand a dataset, but to train models, people were usually downloading the data locally, or using a cloud storage service such as S3. That's what we were doing for training SmolVLM; we had all of our data on S3 and were streaming directly from it.

We wanted to change that, so we decided to use streaming from the Hub when we were developing nanoVLM. Soon we found a big issue: our test run generated over 100,000 requests in under a minute, which got our IP blocked by the Hub! 😅 This happened because every DataLoader worker was initializing the dataset independently. As we dug deeper, we found that this creates a storm of redundant requests, many of which are unnecessary. Our changes ultimately reduced startup requests by a factor of 100. In total, our improvements delivered:

Data files resolution time: 10× faster
Startup requests: Up to 100× more efficient
Streaming speed: Up to 2× faster
In‑flight requests: Up to 2× more efficient

Under the Hood: What We Improved

1. Startup⚡️

The initial resolution of data files was creating a ton of requests. We made two major changes:

Persistent Data Files Cache – We are now caching the list of data files across all DataLoader workers. The first worker resolves the file list from the Hub. All others read directly from this local cache, virtually eliminating startup requests and slashing resolution time. No more request storms!
Optimized Resolution Logic – We also minimized the number of API calls required for that initial worker to fetch the file list. We now bundle the necessary requests as efficiently as possible, reducing latency even further.

2. Streaming 🏎️

To improve throughput during streaming itself, we've introduced two new features:

Prefetching for Parquet – We enabled prefetching for Parquet datasets. While your model is processing the current chunk of data, the datasets library is already fetching the next chunk in the background. This keeps the data pipeline full and ensures your GPU is never left waiting for data.
Configurable Buffering – Advanced users can now fine‑tune streaming performance for their specific hardware and network setup. We've exposed options to configure the buffer's block size and the prefetch volume, giving you maximum control to optimize I/O.

This is how we can increase the minimum request size when streaming from 32 MiB (default) to 128 MiB and configure prefetching:


import pyarrow

import pyarrow.dataset



fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(

    cache_options=pyarrow.CacheOptions(

        prefetch_limit=1,

        range_size_limit=128 << 20,

    ),

)



ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

Together, these improvements can double your data throughput, allowing you to train faster and more efficiently.

How We’re Faster Than Plain S3: Xet

Hugging Face uses Xet: a dedupe‑based storage that enables fast deduped uploads and downloads. Unlike traditional remote storage, data transfers are faster on Xet because duplicated data is only transferred once. For example: uploading a large‑scale dataset to Hugging Face leverages Xet which accelerates uploads. Once the dataset is uploaded, it can be streamed right away.

Deduplication for Parquet is enabled through Parquet Content‑Defined Chunking (CDC). Thanks to Parquet CDC and Xet deduplication, uploading datasets on Hugging Face is faster than on any traditional remote storage.

This is supported by our pyspark_huggingface package, a Spark Data Source to read/write HF datasets. It includes Parquet CDC and Xet support, accelerating data transfers on HF dramatically.

Need a Custom Streaming Pipeline?

Some data file formats are not supported in datasets, and sometimes there is a need for more control, so we made it easy to build custom streaming pipelines. This has been battle‑tested in the LeRobot library to sample video frames, and in the WebDataset library to stream TAR archives.

We improved the HfFileSystem in the huggingface_hub library to efficiently read files from remote Hugging Face dataset repositories and stream data:


from huggingface_hub import HfFileSystem



path = f"hf://datasets/{dataset_id}/{path_in_repo}"

with HfFileSystem().open(path) as f:

    # loop with .read() or .readline() to stream data

    # or do random access with .seek()

Passing a HfFileSystem to a torch DataLoader reuses the cached results from .ls() and .glob() which eliminates the need for additional requests when listing data files.

Push Streaming to the Limit

We're now using these streaming enhancements in nanoVLM to train the next generation of SmolVLMs. With these tweaks, we achieve better performance from streaming than from training on our cluster's hierarchical hard‑disk setup. In fact, streaming is now as fast as reading the data from local SSDs! Previously, transferring data to local SSDs was the process that used to delay our trainings by three hours. For more details, check out our GitHub.

Get Started and See the Difference

These powerful new features landed in the datasets and huggingface_hub libraries. To take advantage of them, simply update your libraries and check out the documentation:


pip install --upgrade datasets huggingface_hub

To celebrate this, we pre‑concatenated and shuffled all the data sources in FineVision into FineVisionMax. You can use this single combined dataset to train your VLM – no need to handle multiple datasets manually!


from datasets import load_dataset



# Stream a dataset instead of downloading it

dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)



# Get the first example

print(next(iter(dataset)))

And you can see how we do it at scale in nanoVLM!

Happy streaming! 🤗

Read Original Article

Comments

Want to join the conversation?

Loading comments...

Streaming datasets: 100x More Efficient

Published October 27, 2025

Authors:

Andres Marafioti
Quentin Lhoest
ben burtenshaw
Pedro Cuenca
merve

TLDR

We boosted load_dataset('dataset', streaming=True)—streaming datasets without downloading them—with one line of code!

Start training on multi‑TB datasets immediately, without complex setups, downloading, no “disk out of space”, or 429 “stop requesting!” errors. It’s super fast! Outrunning our local SSDs when training on 64 x H100 with 256 workers downloading data.

We've improved streaming to have 100x fewer requests → 10× faster data resolution → 2x sample/sec, → 0 worker crashes at 256 concurrent workers.

What did we do exactly? ⤵️

Streaming: The Same Easy API

First things first: our changes are backwards compatible. You can still stream any dataset from the Hub with the same simple streaming=True flag. It's as easy as ever. 🚀


from datasets import load_dataset



# Stream a dataset instead of downloading it

dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)



# Get the first example

print(next(iter(dataset)))

Thousands of AI developers around the world use datasets daily; they should just get improved performance with zero extra work.

The Challenge: Streaming at Scale

Data files resolution time: 10× faster
Startup requests: Up to 100× more efficient
Streaming speed: Up to 2× faster
In‑flight requests: Up to 2× more efficient

Under the Hood: What We Improved

1. Startup⚡️

The initial resolution of data files was creating a ton of requests. We made two major changes:

Persistent Data Files Cache – We are now caching the list of data files across all DataLoader workers. The first worker resolves the file list from the Hub. All others read directly from this local cache, virtually eliminating startup requests and slashing resolution time. No more request storms!
Optimized Resolution Logic – We also minimized the number of API calls required for that initial worker to fetch the file list. We now bundle the necessary requests as efficiently as possible, reducing latency even further.

2. Streaming 🏎️

To improve throughput during streaming itself, we've introduced two new features:

Prefetching for Parquet – We enabled prefetching for Parquet datasets. While your model is processing the current chunk of data, the datasets library is already fetching the next chunk in the background. This keeps the data pipeline full and ensures your GPU is never left waiting for data.
Configurable Buffering – Advanced users can now fine‑tune streaming performance for their specific hardware and network setup. We've exposed options to configure the buffer's block size and the prefetch volume, giving you maximum control to optimize I/O.

This is how we can increase the minimum request size when streaming from 32 MiB (default) to 128 MiB and configure prefetching:


import pyarrow

import pyarrow.dataset



fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(

    cache_options=pyarrow.CacheOptions(

        prefetch_limit=1,

        range_size_limit=128 << 20,

    ),

)



ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)

Together, these improvements can double your data throughput, allowing you to train faster and more efficiently.

How We’re Faster Than Plain S3: Xet

This is supported by our pyspark_huggingface package, a Spark Data Source to read/write HF datasets. It includes Parquet CDC and Xet support, accelerating data transfers on HF dramatically.

Need a Custom Streaming Pipeline?

We improved the HfFileSystem in the huggingface_hub library to efficiently read files from remote Hugging Face dataset repositories and stream data:


from huggingface_hub import HfFileSystem



path = f"hf://datasets/{dataset_id}/{path_in_repo}"

with HfFileSystem().open(path) as f:

    # loop with .read() or .readline() to stream data

    # or do random access with .seek()

Passing a HfFileSystem to a torch DataLoader reuses the cached results from .ls() and .glob() which eliminates the need for additional requests when listing data files.

Push Streaming to the Limit

Get Started and See the Difference

These powerful new features landed in the datasets and huggingface_hub libraries. To take advantage of them, simply update your libraries and check out the documentation:


pip install --upgrade datasets huggingface_hub


from datasets import load_dataset



# Stream a dataset instead of downloading it

dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)



# Get the first example

print(next(iter(dataset)))

And you can see how we do it at scale in nanoVLM!

Happy streaming! 🤗

AI Pulse

Streaming Datasets: 100x More Efficient

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Streaming datasets: 100x More Efficient

Streaming datasets: 100x More Efficient

TLDR

Streaming: The Same Easy API

The Challenge: Streaming at Scale

Under the Hood: What We Improved

1. Startup⚡️

2. Streaming 🏎️

How We’re Faster Than Plain S3: Xet

Need a Custom Streaming Pipeline?

Push Streaming to the Limit

Get Started and See the Difference

Comments

AI Pulse

Streaming Datasets: 100x More Efficient

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Streaming datasets: 100x More Efficient

Streaming datasets: 100x More Efficient

TLDR

Streaming: The Same Easy API

The Challenge: Streaming at Scale

Under the Hood: What We Improved

1. Startup⚡️

2. Streaming 🏎️

How We’re Faster Than Plain S3: Xet

Need a Custom Streaming Pipeline?

Push Streaming to the Limit

Get Started and See the Difference

Comments