How to Tune NVMe Storage for High Throughput AI Training

Posted by Dev S.
4
Sep 3, 2025
175 Views
Image

Idle GPUs are expensive and I refuse to let storage be the reason.

My goal is simple: keep the path from disk to device so smooth that storage becomes invisible and training feels predictably fast.

The playbook below is the order I use on fresh nodes and shared clusters: prove a baseline, fix topology, choose shard-friendly formats, nudge the block layer and keep the dataloader ahead of the GPUs.

1.Establish a clean baseline.


Before touching any settings, I let the system to settle and run a simple read test to capture realistic throughput.

On healthy PCIe Gen4 NVMes, large sequential reads usually land in the multi-GB/s range.

Those figures become my baseline, therefore every later tweak must beat this reference during real training.

# Non-destructive sequential read baseline 
fio --name=read --filename=/mnt/nvme0n1/testfile --size=16G \ 
--rw=read --bs=1M --iodepth=32 --ioengine=io_uring --direct=1 --numjobs=1 
 
# Snapshot of small random-read capability 
fio --name=randread --filename=/mnt/nvme0n1/testfile --size=16G \ 
--rw=randread --bs=4k --iodepth=64 --ioengine=io_uring --direct=1 --numjobs=1 

2. Align topology so storage and GPUs sit close.


 NVMe devices and GPUs should share the same CPU socket.

This placement shortens the path and reduces cross-socket penalties, therefore input rates stabilize without any code changes.

I also ensure drives use full-speed PCIe lanes rather than chipset-attached slots.

# Inspect PCIe tree and placement 
lspci -tv 
 
# Confirm NUMA layout and memory nodes 
numactl --hardware 

3. Add striping when a single drive cannot keep up.


When several GPUs read in parallel, one NVMe often cannot supply enough bandwidth. 

Accordingly, I stripe identical drives for aggregate throughput and present one fast volume to the filesystem.

I do this only on empty devices because the operation wipes data.

# DESTRUCTIVE: creates RAID0 over four NVMes; wipes data 
sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/nvme[0-3]n1 --chunk=1024 
sudo mkfs.xfs -f /dev/md0 
sudo mount -t xfs -o noatime /dev/md0 /data 

4. Choose a solid filesystem and mount sensibly.


On Linux, I default to XFS, although ext4 also performs well. Mounting with noatime avoids needless metadata writes during heavy reads.

Meanwhile, local NVMe acts as my hot cache and scratch tier and durability lives in object storage. Recovery stays simple if a drive fails and I never confuse speed with safety. For production environments, enterprise cloud storage solutions provide the reliability and scalability needed for long-term data retention while maintaining fast local access patterns.

# Prefer scheduled fstrim over continuous discard during training 
sudo fstrim -av 

5. Nudge the block layer to favour


NVMe devices benefit from a lean scheduler, deeper queues and generous read-ahead. 

Individually these changes seem modest; however, together they reduce stalls and keep large transfers smooth across epochs.

I also keep the CPU governor on performance and let irqbalance distribute interrupts.

# Scheduler and queue depth 
for q in /sys/block/nvme*n1/queue; do 
echo none | sudo tee $q/scheduler 
echo 1024 | sudo tee $q/nr_requests 
done 
 
# Larger read-ahead (example: 8 MiB) on the striped device 
sudo blockdev --setra 16384 /dev/md0 

6. Match I/O sizes to the access pattern.


For streaming datasets, one-to-four-megabyte reads align well with controller queues and unlock peak bandwidth.

During heavy sequential scans, I prefer direct I/O to avoid page-cache churn.

Conversely, when dealing with many small random reads, I allow the page cache to help or I rethink the on-disk layout so access becomes more sequential.

Thus, block size and access pattern reinforce each other rather than work at cross-purposes.

7. Use shard-friendly formats rather than tiny files.


Millions of small files cause metadata storms and random I/O that overwhelm even fast drives.

Consequently, I convert datasets into shards such as WebDataset tar files, TFRecord, Parquet or MDS, then read several shards concurrently.

For vision, audio and multimodal corpora, this single shift often delivers the largest and most immediate improvement in throughput and stability.

8. Keep the dataloader ahead of the GPUs.


Storage tuning pays off only if the input pipeline consumes data efficiently.

Therefore, I increase worker counts, keep them persistent across epochs, deepen prefetch and enable pinned memory for quicker host-to-device transfers.

If CPU decoding becomes a bottleneck, I adopt NVIDIA DALI or simplify transforms.

As a rule of thumb, I budget roughly 0.5 to 2.0 GB/s per GPU, then I scale readers to meet that target, so utilisation stays high.

# PyTorch: a DataLoader tuned to stay ahead 
from torch.utils.data import DataLoader 
 
loader = DataLoader( 
dataset, 
batch_size=bs, 
shuffle=True, 
num_workers=8, # raise as cores allow 
persistent_workers=True, # avoid worker spin-up each epoch 
prefetch_factor=4, # keep queues full 
pin_memory=True # faster host-to-device transfers 

9. Write checkpoints locally, then sync asynchronously.


Large checkpoint writes can pause training at awkward moments.

To avoid this, I land checkpoints on local NVMe first and then sync to object storage in the background for durability.

When checkpointing frequently, I separate reads and writes across different NVMes or stripes to minimize contention.

Additionally, modest writeback tuning reduces bursty pauses and the loop stays responsive even during frequent saves.

# Async copy to remote/object storage after a local checkpoint 
rsync -a --bwlimit=500000 /data/checkpoints s3://bucket/run-123/ & 
 
# Gentler writeback to avoid big pauses 
sudo sysctl -w vm.dirty_background_ratio=5 
sudo sysctl -w vm.dirty_ratio=20 

10. Consider GPU Direct Storage once fundamentals are solid.


With compatible drivers and filesystems, NVIDIA GPUDirect Storage moves data from NVMe to GPU memory with less CPU copying.

I keep I/O 4k-aligned and near one-to-four-megabyte operations to maintain smooth throughput.

Nevertheless, I bring GDS in only after topology, shard formats and dataloader settings are correct. Otherwise, a faster path merely exposes a poorly organized dataset more quickly.

11. Monitor device health to prevent mid-run slowdowns.


Performance fades when drives overheat or develop media issues. Consequently, I keep firmware current, review SMART logs and watch temperatures during long epochs.

Better airflow often restores speed when thermal throttling appears; otherwise, bandwidth can halve across a run, which looks like sporadic GPU idling despite stable code.

# Health and firmware checks 
sudo nvme smart-log /dev/nvme0n1 
sudo nvme id-ctrl /dev/nvme0n1 | grep -i version 

12. Observe, adjust and iterate during training.


After launch, I watch GPU utilization, disk utilization and dataloader timings together.

If GPUs slip below the high eighties or nineties, I raise workers, deepen prefetch, convert more data into shards or lighten transforms.

If disks sit at sustained saturation, I add striping, increase read sizes or slightly reduce per-GPU demand to stabilize the pipeline.

Subsequently, I change one variable at a time and compare results against the baseline so improvements are real.

Make Performance the Default!


Great storage tuning should be invisible and that is the point.

Once topology is aligned, shards flow cleanly, block-layer nudges are in place and the dataloader consistently runs ahead, training becomes predictably fast, epoch after epoch.

I size local NVMe at one to two times the dataset so there is room for shards, caches and checkpoints and I keep durability elsewhere, so recovery stays painless and risk stays low.

If I were you, I’d start small, measure honestly and iterate with intent.

Comments
avatar
Please sign in to add comment.