Datasets and Packing: Naive vs Streaming
This guide explains the two dataset implementations in LMMS Engine and helps you choose the right approach for your training needs.
Overview
LMMS Engine provides two distinct dataset implementations:
Dataset Type |
Class |
Description |
Best For |
|---|---|---|---|
Naive (Map-style) |
|
Precomputes packing groups before training |
Small to medium datasets, deterministic packing |
Streaming (Iterable) |
|
Packs sequences on-the-fly during iteration |
Large datasets, low memory usage, dynamic data |
Both implementations share the same DatasetConfig interface for seamless switching between approaches.
Quick Start
Basic Usage
from lmms_engine.datasets import DatasetConfig, MultiModalDataset, MultiModalIterableDataset
from lmms_engine.train import FSDP2SFTTrainer
# Configure your dataset
config = DatasetConfig(
# Core settings
dataset_type="vision", # Type: vision | vision_audio | fineweb_edu | rae | sit | qwen_omni
# Note: Use vision_iterable or bagel_iterable for streaming versions
dataset_format="hf_dataset", # Format: json | jsonl | csv | yaml | hf_dataset | arrow | parquet
dataset_path="your/dataset/path", # Path to dataset or HF Hub ID
# Processing
processor_config={"processor_type": "your_processor"},
shuffle=True,
# Packing configuration
packing=True, # Enable sequence packing
packing_length=32000, # Maximum tokens per packed sequence
filter_overlong=True, # Drop sequences > packing_length
packing_strategy="first_fit", # Naive only: first_fit | window_XX (ignored for Streaming)
)
# Choose your dataset implementation
# Option 1: Naive (precomputed packing)
dataset = MultiModalDataset(config)
# Option 2: Streaming (on-the-fly packing)
# Important: For Streaming dataset, prefer dataset_format="hf_dataset", "arrow", or "parquet"
# json/jsonl formats work better with Naive dataset
dataset = MultiModalIterableDataset(config)
# Build and use
dataset.build()
collator = dataset.get_collator()
# Train with FSDP2
trainer = FSDP2SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=collator
)
trainer.train()
Dataset Implementation Details
Naive Dataset (Precomputed Packing)
The MultiModalDataset loads all data into memory and precomputes optimal packing arrangements before training begins.
How it works:
Load: Loads dataset (memory-mapped for arrow/parquet/hf_dataset, full load for json/jsonl)
Estimate: Calculates token length for each sample using map operations
Pack: Precomputes optimal packing arrangements using selected algorithm
Serve: Returns precomputed packs during training
Characteristics:
✅ Deterministic: Same packing arrangement every epoch
✅ Optimal packing: Can use sophisticated algorithms for better utilization
✅ Known length: Exact number of steps per epoch is known
⚠️ Memory usage: Full load for json/jsonl; memory-mapped for arrow/parquet/hf_dataset
❌ Slower startup: Preprocessing adds initialization time
When to use:
Dataset fits comfortably in memory (< 100GB)
You need reproducible training runs
Packing efficiency is critical
You’re debugging or experimenting
Streaming Dataset (On-the-fly Packing)
The MultiModalIterableDataset streams data and packs sequences dynamically during iteration.
How it works:
Stream: Loads data samples one at a time
Buffer: Accumulates samples in a buffer
Pack: When buffer + next sample >
packing_length, yields bufferFlush: Yields remaining buffer at epoch end
Characteristics:
✅ Memory efficient: Streams data samples without precomputing packs
✅ Fast startup: No preprocessing required
✅ Scales infinitely: Works with any dataset size
❌ Non-deterministic: Different packing each epoch
❌ Unknown length: Can’t calculate exact steps per epoch (use
max_stepsinstead ofnum_train_epochs)❌ Suboptimal packing: Uses greedy buffer-filling strategy - yields buffer when
buffer_length + next_sample > packing_length, may waste tokens compared to global optimization
When to use:
Large datasets (> 100GB)
Limited memory environments
Continuous/streaming data sources
Production training at scale
Distributed Training Behavior
Naive Dataset
Uses
DistributedSamplerorDistributedLengthGroupedSamplerEach rank gets deterministic subset of packs
Steps per epoch =
total_packs / world_sizeSupports
group_by_lengthfor improved training efficiency
Streaming Dataset
Performs rank sharding via
HFDataset.shard()Worker splitting via
torch.utils.data.get_worker_info()Dynamic step count per rank (depends on data distribution)
No sampler attachment (handled internally)
⚠️ Important: Ensure dataset length divides evenly across ranks to avoid imbalanced workloads.
Configuration Reference
Core Parameters
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable sequence packing |
|
|
int |
Maximum tokens per packed sequence |
|
|
bool |
Drop sequences exceeding |
|
|
str |
Naive only: |
|
|
bool |
Shuffle dataset before packing |
|
Packing Strategies (Naive Only)
first_fit: Greedily pack sequences into first available spacewindow_XX: Group sequences within sliding windows of size XX (e.g.,window_100)First sorts sequences by length
Groups sequences within windows of XX consecutive samples
Provides better packing for sorted data while maintaining some randomness
Configuration Examples
YAML Configuration
dataset:
dataset_type: vision
dataset_format: hf_dataset
dataset_path: your/dataset/path
shuffle: true
packing: true
packing_length: 32000
filter_overlong: true
processor_config:
processor_type: your_processor
Python Configuration
config = DatasetConfig(
dataset_type="vision",
dataset_format="hf_dataset",
dataset_path="your/dataset/path",
packing=True,
packing_length=32000,
filter_overlong=True,
processor_config={"processor_type": "your_processor"}
)
Performance Tips
Optimizing Packing Efficiency
Choose appropriate
packing_length:Too small: Underutilized sequences
Too large: May exceed memory limits
Recommended: Start with model’s max sequence length
Monitor packing metrics:
# Trainer logs these automatically - perf/global_seq_len_avg # Average packed sequence length - perf/global_seq_len_min # Minimum across ranks - perf/global_seq_len_max # Maximum across ranks
Handle outliers:
Set
filter_overlong=Trueto drop anomalously long sequencesPrevents memory spikes and improves batch consistency
Memory Management
For Naive Dataset:
# Estimate memory usage
estimated_memory = num_samples * avg_sample_size * 1.2 # 20% overhead
For Streaming Dataset:
# Memory usage is constant
max_memory = batch_size * packing_length * token_size
Troubleshooting
Common Issues
1. Distributed Training Hangs
Problem: Collective operations hang during training.
Solution: Ensure all ranks have identical tensor shapes:
# Bad: Different shapes across ranks
loss = torch.tensor([loss1, loss2, ...])
# Good: Scalar aggregation
loss = torch.tensor(loss.item())
torch.distributed.all_reduce(loss, op=ReduceOp.AVG)
2. Imbalanced Workload
Problem: Some ranks finish before others.
Solution:
Ensure dataset size is divisible by world_size
Use streaming dataset for better load balancing
Enable
shuffle=Trueto randomize distribution
3. OOM with Naive Dataset
Problem: Out of memory during dataset loading.
Solutions:
Switch to streaming dataset
Reduce
packing_lengthEnable
filter_overlong=TrueUse data sharding:
dataset = load_dataset("path", split=f"train[{rank}:{rank+1}:{world_size}]")
Decision Matrix
Criterion |
Naive Dataset |
Streaming Dataset |
|---|---|---|
Dataset Size |
< 100GB |
Any size |
Memory Usage |
Medium (memory-mapped) to High (json/jsonl) |
Low |
Best Formats |
All supported |
hf_dataset, arrow, parquet |
Startup Time |
Slow |
Fast |
Packing Quality |
Optimal |
Good |
Reproducibility |
Yes |
No |
Step Count Known |
Yes |
No |
LR Schedulers |
All supported |
Limited (use max_steps) |
Best For |
Research, debugging |
Production, scale |
Migration Guide
From Naive to Streaming
# Before (Naive)
from lmms_engine.datasets import MultiModalDataset
dataset = MultiModalDataset(config)
# After (Streaming)
from lmms_engine.datasets import MultiModalIterableDataset
dataset = MultiModalIterableDataset(config)
# Note: Ensure dataset_format="hf_dataset"
From Streaming to Naive
# Before (Streaming)
from lmms_engine.datasets import MultiModalIterableDataset
dataset = MultiModalIterableDataset(config)
# After (Naive)
from lmms_engine.datasets import MultiModalDataset
dataset = MultiModalDataset(config)
# Note: May need to adjust memory allocation