Creating New Datasets

This guide walks you through creating a new dataset for the LMMS Engine. We support both map-style (naive) and iterable datasets to handle different data loading scenarios.

Architecture Overview

The LMMS Engine provides a flexible dataset framework with the following hierarchy:

BaseDataset / BaseIterableDataset
    ↓
MultiModalDataset / MultiModalIterableDataset
    ↓
(MultiModalDataLoadingMixin)
    ↓
YourCustomDataset

Key Components

  • Base Classes: Abstract base classes that define the dataset interface

  • MultiModal Classes: Implement common functionality for handling images, audio, and video

  • MultiModalDataLoadingMixin: Provides reusable methods for loading different media types

  • Your Custom Dataset: Inherits from one of the multimodal classes and implements data format-specific logic

Choosing Between Map-Style and Iterable Datasets

Iterable Datasets (For Streaming Data)

Use MultiModalIterableDataset when:

  • Your dataset is very large or streams data continuously

  • You’re using it with a real-time data source

  • You prefer yielding samples via iteration rather than indexing

  • Your data format naturally supports streaming

Advantages:

  • Handles large datasets without loading everything into memory

  • Native streaming support for distributed training

  • Better for continuous data pipelines

  • Can dynamically fetch and process data on-the-fly

Example:

dataset = VisionSFTIterableDataset(config)
for sample in dataset:
    process(sample)

Quick Start: Creating a Map-Style Dataset

Here’s the simplest approach - inherit from MultiModalDataset:

Step 1: Create Your Dataset Class

from typing import Dict
import torch
from PIL import Image

from lmms_engine.datasets.naive.multimodal_dataset import MultiModalDataset
from lmms_engine.datasets.collator import VisionCollator
from lmms_engine.mapping_func import register_dataset
from lmms_engine.utils.train_utils import TrainUtilities


@register_dataset("my_dataset")
class MyCustomDataset(MultiModalDataset):
    """Custom dataset for handling my specific data format."""
    
    def load_from_json(self, data, data_folder=None) -> Dict[str, torch.Tensor]:
        """
        Load and process data from JSON format.
        
        Args:
            data: Dictionary containing a 'messages' key with conversation data
            data_folder: Optional base folder for relative file paths
            
        Returns:
            Dictionary with 'input_ids', 'attention_mask', etc.
        """
        messages = data["messages"]
        images_list = []
        videos = []
        kwargs = {}
        
        # Extract media from messages
        for message in messages:
            for content in message["content"]:
                if content["type"] == "image_url":
                    images_list.append(content["image_url"]["url"])
                elif content["type"] == "video_url":
                    frames, sample_fps = self.load_videos(
                        content["video_url"]["url"],
                        data_folder=data_folder,
                        fps=self.config.fps,
                    )
                    videos.append(frames)
                    kwargs["fps"] = sample_fps
        
        # Convert to HuggingFace format
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        
        # Load images
        if data_folder is not None:
            images = [
                Image.open(os.path.join(data_folder, img)) 
                for img in images_list
            ]
        else:
            images = [Image.open(img) for img in images_list]
        
        if len(images) == 0:
            images = None
        if len(videos) == 0:
            videos = None
        
        # Process through the configured processor
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages, 
            videos=videos, 
            **kwargs
        )
        return inputs
    
    def load_from_hf(self, data) -> Dict[str, torch.Tensor]:
        """Load from HuggingFace dataset format."""
        messages = data["messages"]
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        
        # Handle single or multiple images
        if isinstance(data["image"], list):
            images = data["image"]
        else:
            images = [data["image"]]
        
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages
        )
        return inputs
    
    def get_collator(self):
        """Return the appropriate collator for batching."""
        return VisionCollator(self.processor)

Step 2: Register Your Dataset

The @register_dataset("my_dataset") decorator automatically registers your dataset. You can then use it in your config:

dataset_type: my_dataset

Creating an Iterable Dataset

For iterable datasets, inherit from MultiModalIterableDataset:

from lmms_engine.datasets.iterable.multimodal_iterable_dataset import (
    MultiModalIterableDataset,
)


@register_dataset("my_iterable_dataset")
class MyIterableDataset(MultiModalIterableDataset):
    """Streaming dataset for continuous data pipelines."""
    
    def load_from_json(self, data, data_folder=None) -> Dict[str, torch.Tensor]:
        # Same implementation as map-style
        # The base class handles streaming logic
        pass
    
    def load_from_hf(self, data) -> Dict[str, torch.Tensor]:
        # Same implementation as map-style
        pass
    
    def get_collator(self):
        return VisionCollator(self.processor)

The base MultiModalIterableDataset handles the __iter__ method for you, calling your load_from_json or load_from_hf methods as it iterates through the data.

Required Methods to Implement

Every custom dataset must implement these methods:

1. load_from_json(data, data_folder=None)

Transforms raw JSON data into processor-ready format:

  • Extract media paths/URLs

  • Load images using self.load_image()

  • Load videos using self.load_videos()

  • Load audio using self.load_audio()

  • Convert to HuggingFace message format

  • Return processed dictionary with tensor outputs

2. load_from_hf(data)

Handles HuggingFace dataset format:

  • Extract data fields

  • Process media similarly to JSON

  • Return processed dictionary

3. get_collator()

Returns a collator instance for batching:

  • Import appropriate collator (e.g., VisionCollator, AudioCollator)

  • Pass your processor instance

  • Example: return VisionCollator(self.processor)

4. _build_from_config() (Optional)

Override if you need custom initialization logic. The base class already handles:

  • Loading various data formats (JSON, JSONL, Arrow, Parquet, HuggingFace, YAML)

  • Shuffling

  • Token estimation

  • Packing (for map-style)

Available Media Loading Methods

The MultiModalDataLoadingMixin provides these methods:

Loading Images

image = self.load_image(image_path, data_folder=None)
# Returns: PIL.Image

Loading Audio

audio = self.load_audio(audio_path, sr=16000, data_folder=None)
# Returns: numpy.ndarray (1D)

Loading Videos

frames, sample_fps = self.load_videos(
    video_path, 
    data_folder=None, 
    fps=1
)
# Returns: (numpy.ndarray, float)

Supported Data Formats

Map-style and iterable datasets support loading from:

  • JSON: List of data dictionaries with ‘messages’

  • JSONL: Line-delimited JSON

  • Arrow: Hugging Face arrow format

  • Parquet: Parquet format

  • HuggingFace: Direct HF dataset loading

  • YAML: Inline datasets or external YAML files

Object Storage Support

Both dataset types support cloud storage backends:

  • GCS (Google Cloud Storage): Set object_storage: "gcs"

  • Azure Blob Storage: Set object_storage: "azure"

  • Local filesystem: Set object_storage: "none" (default)

Configuration

Your dataset configuration typically looks like:

dataset_type: my_dataset
dataset_format: json  # json, jsonl, arrow, parquet, hf_dataset, yaml
dataset_path: /path/to/data.json
shuffle: true
filter_overlong: true
max_length: 2048
packing: false
processor_config:
  processor_type: qwen_processor  # Or your processor type

Best Practices

  1. Start with MultiModalDataset: Unless you have a specific need for streaming, use map-style datasets. They’re simpler and more compatible with standard training loops.

  2. Reuse MultiModalDataset: Inheriting from MultiModalDataset gives you all the built-in data format handling and media loading methods.

  3. Implement format handlers: Focus on load_from_json() and load_from_hf(). These are the main customization points.

  4. Use TrainUtilities.convert_open_to_hf(): This standardizes your message format for the processor.

  5. Handle missing media gracefully: Set images/videos to None if not present.

  6. Leverage the processor: The processor handles tokenization, image resizing, etc. Pass it the standard format and let it work.

Example: Complete Vision Dataset

from typing import Dict
import os
import torch
from PIL import Image

from lmms_engine.datasets.naive.multimodal_dataset import MultiModalDataset
from lmms_engine.datasets.collator import VisionCollator
from lmms_engine.mapping_func import register_dataset
from lmms_engine.utils.train_utils import TrainUtilities


@register_dataset("custom_vision")
class CustomVisionDataset(MultiModalDataset):
    
    def load_from_json(self, data, data_folder=None) -> Dict[str, torch.Tensor]:
        messages = data["messages"]
        images_list = []
        
        for message in messages:
            for content in message["content"]:
                if content["type"] == "image_url":
                    images_list.append(content["image_url"]["url"])
        
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        
        if data_folder is not None:
            images = [
                Image.open(os.path.join(data_folder, img)) 
                for img in images_list
            ]
        else:
            images = [Image.open(img) for img in images_list] if images_list else None
        
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages
        )
        return inputs
    
    def load_from_hf(self, data) -> Dict[str, torch.Tensor]:
        messages = data["messages"]
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        images = data.get("image", None)
        
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages
        )
        return inputs
    
    def get_collator(self):
        return VisionCollator(self.processor)

Testing Your Dataset

After implementation, test your dataset:

from lmms_engine.datasets.config import DatasetConfig

config = DatasetConfig(
    dataset_type="my_dataset",
    dataset_format="json",
    dataset_path="path/to/data.json",
    processor_config={"processor_type": "qwen_processor"}
)

dataset = MyCustomDataset(config)
dataset.build()

# Test access
sample = dataset[0]
print(sample.keys())  # Should have input_ids, attention_mask, etc.

# Test collator
collator = dataset.get_collator()
batch = collator([dataset[i] for i in range(4)])

Common Issues

Issue: AttributeError for load_from_json

Cause: Method signature mismatch Solution: Ensure your method signature matches: load_from_json(self, data, data_folder=None)

Issue: Missing media files

Cause: Incorrect path construction Solution: Always use os.path.join(data_folder, path) when data_folder is provided

Issue: Processor returns empty tensors

Cause: Incorrect message format for processor Solution: Use TrainUtilities.convert_open_to_hf() to standardize format

See Also