# Creating New Datasets

This guide walks you through creating a new dataset for the LMMS Engine. We support both **map-style (naive)** and **iterable** datasets to handle different data loading scenarios.

## Architecture Overview

The LMMS Engine provides a flexible dataset framework with the following hierarchy:

```
BaseDataset / BaseIterableDataset
    ↓
MultiModalDataset / MultiModalIterableDataset
    ↓
(MultiModalDataLoadingMixin)
    ↓
YourCustomDataset
```

### Key Components

- **Base Classes**: Abstract base classes that define the dataset interface
- **MultiModal Classes**: Implement common functionality for handling images, audio, and video
- **MultiModalDataLoadingMixin**: Provides reusable methods for loading different media types
- **Your Custom Dataset**: Inherits from one of the multimodal classes and implements data format-specific logic

## Choosing Between Map-Style and Iterable Datasets

### Map-Style Datasets (Recommended for Most Use Cases)

**Use `MultiModalDataset` when:**
- Your dataset fits entirely in memory or has a fixed, known size
- You need random access to samples (important for training with shuffling)
- Your data is relatively static and doesn't stream continuously
- You want simpler implementation with fewer moving parts

**Advantages:**
- Simpler implementation with standard indexing (`__getitem__`)
- Better compatibility with distributed training
- Can easily apply packing strategies for efficient batching
- Supports data shuffling and filtering during dataset building

**Example:**
```python
dataset = VisionSFTDataset(config)
sample = dataset[42]  # Direct random access
```

### Iterable Datasets (For Streaming Data)

**Use `MultiModalIterableDataset` when:**
- Your dataset is very large or streams data continuously
- You're using it with a real-time data source
- You prefer yielding samples via iteration rather than indexing
- Your data format naturally supports streaming

**Advantages:**
- Handles large datasets without loading everything into memory
- Native streaming support for distributed training
- Better for continuous data pipelines
- Can dynamically fetch and process data on-the-fly

**Example:**
```python
dataset = VisionSFTIterableDataset(config)
for sample in dataset:
    process(sample)
```

## Quick Start: Creating a Map-Style Dataset

Here's the simplest approach - inherit from `MultiModalDataset`:

### Step 1: Create Your Dataset Class

```python
from typing import Dict
import torch
from PIL import Image

from lmms_engine.datasets.naive.multimodal_dataset import MultiModalDataset
from lmms_engine.datasets.collator import VisionCollator
from lmms_engine.mapping_func import register_dataset
from lmms_engine.utils.train_utils import TrainUtilities


@register_dataset("my_dataset")
class MyCustomDataset(MultiModalDataset):
    """Custom dataset for handling my specific data format."""
    
    def load_from_json(self, data, data_folder=None) -> Dict[str, torch.Tensor]:
        """
        Load and process data from JSON format.
        
        Args:
            data: Dictionary containing a 'messages' key with conversation data
            data_folder: Optional base folder for relative file paths
            
        Returns:
            Dictionary with 'input_ids', 'attention_mask', etc.
        """
        messages = data["messages"]
        images_list = []
        videos = []
        kwargs = {}
        
        # Extract media from messages
        for message in messages:
            for content in message["content"]:
                if content["type"] == "image_url":
                    images_list.append(content["image_url"]["url"])
                elif content["type"] == "video_url":
                    frames, sample_fps = self.load_videos(
                        content["video_url"]["url"],
                        data_folder=data_folder,
                        fps=self.config.fps,
                    )
                    videos.append(frames)
                    kwargs["fps"] = sample_fps
        
        # Convert to HuggingFace format
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        
        # Load images
        if data_folder is not None:
            images = [
                Image.open(os.path.join(data_folder, img)) 
                for img in images_list
            ]
        else:
            images = [Image.open(img) for img in images_list]
        
        if len(images) == 0:
            images = None
        if len(videos) == 0:
            videos = None
        
        # Process through the configured processor
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages, 
            videos=videos, 
            **kwargs
        )
        return inputs
    
    def load_from_hf(self, data) -> Dict[str, torch.Tensor]:
        """Load from HuggingFace dataset format."""
        messages = data["messages"]
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        
        # Handle single or multiple images
        if isinstance(data["image"], list):
            images = data["image"]
        else:
            images = [data["image"]]
        
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages
        )
        return inputs
    
    def get_collator(self):
        """Return the appropriate collator for batching."""
        return VisionCollator(self.processor)
```

### Step 2: Register Your Dataset

The `@register_dataset("my_dataset")` decorator automatically registers your dataset. You can then use it in your config:

```yaml
dataset_type: my_dataset
```

## Creating an Iterable Dataset

For iterable datasets, inherit from `MultiModalIterableDataset`:

```python
from lmms_engine.datasets.iterable.multimodal_iterable_dataset import (
    MultiModalIterableDataset,
)


@register_dataset("my_iterable_dataset")
class MyIterableDataset(MultiModalIterableDataset):
    """Streaming dataset for continuous data pipelines."""
    
    def load_from_json(self, data, data_folder=None) -> Dict[str, torch.Tensor]:
        # Same implementation as map-style
        # The base class handles streaming logic
        pass
    
    def load_from_hf(self, data) -> Dict[str, torch.Tensor]:
        # Same implementation as map-style
        pass
    
    def get_collator(self):
        return VisionCollator(self.processor)
```

The base `MultiModalIterableDataset` handles the `__iter__` method for you, calling your `load_from_json` or `load_from_hf` methods as it iterates through the data.

## Required Methods to Implement

Every custom dataset must implement these methods:

### 1. `load_from_json(data, data_folder=None)`

Transforms raw JSON data into processor-ready format:
- Extract media paths/URLs
- Load images using `self.load_image()`
- Load videos using `self.load_videos()`
- Load audio using `self.load_audio()`
- Convert to HuggingFace message format
- Return processed dictionary with tensor outputs

### 2. `load_from_hf(data)`

Handles HuggingFace dataset format:
- Extract data fields
- Process media similarly to JSON
- Return processed dictionary

### 3. `get_collator()`

Returns a collator instance for batching:
- Import appropriate collator (e.g., `VisionCollator`, `AudioCollator`)
- Pass your processor instance
- Example: `return VisionCollator(self.processor)`

### 4. `_build_from_config()` (Optional)

Override if you need custom initialization logic. The base class already handles:
- Loading various data formats (JSON, JSONL, Arrow, Parquet, HuggingFace, YAML)
- Shuffling
- Token estimation
- Packing (for map-style)

## Available Media Loading Methods

The `MultiModalDataLoadingMixin` provides these methods:

### Loading Images

```python
image = self.load_image(image_path, data_folder=None)
# Returns: PIL.Image
```

### Loading Audio

```python
audio = self.load_audio(audio_path, sr=16000, data_folder=None)
# Returns: numpy.ndarray (1D)
```

### Loading Videos

```python
frames, sample_fps = self.load_videos(
    video_path, 
    data_folder=None, 
    fps=1
)
# Returns: (numpy.ndarray, float)
```

## Supported Data Formats

Map-style and iterable datasets support loading from:

- **JSON**: List of data dictionaries with 'messages'
- **JSONL**: Line-delimited JSON
- **Arrow**: Hugging Face arrow format
- **Parquet**: Parquet format
- **HuggingFace**: Direct HF dataset loading
- **YAML**: Inline datasets or external YAML files

## Object Storage Support

Both dataset types support cloud storage backends:

- **GCS (Google Cloud Storage)**: Set `object_storage: "gcs"`
- **Azure Blob Storage**: Set `object_storage: "azure"`
- **Local filesystem**: Set `object_storage: "none"` (default)

## Configuration

Your dataset configuration typically looks like:

```yaml
dataset_type: my_dataset
dataset_format: json  # json, jsonl, arrow, parquet, hf_dataset, yaml
dataset_path: /path/to/data.json
shuffle: true
filter_overlong: true
max_length: 2048
packing: false
processor_config:
  processor_type: qwen_processor  # Or your processor type
```

## Best Practices

1. **Start with `MultiModalDataset`**: Unless you have a specific need for streaming, use map-style datasets. They're simpler and more compatible with standard training loops.

2. **Reuse MultiModalDataset**: Inheriting from `MultiModalDataset` gives you all the built-in data format handling and media loading methods.

3. **Implement format handlers**: Focus on `load_from_json()` and `load_from_hf()`. These are the main customization points.

4. **Use TrainUtilities.convert_open_to_hf()**: This standardizes your message format for the processor.

5. **Handle missing media gracefully**: Set images/videos to `None` if not present.

6. **Leverage the processor**: The processor handles tokenization, image resizing, etc. Pass it the standard format and let it work.

## Example: Complete Vision Dataset

```python
from typing import Dict
import os
import torch
from PIL import Image

from lmms_engine.datasets.naive.multimodal_dataset import MultiModalDataset
from lmms_engine.datasets.collator import VisionCollator
from lmms_engine.mapping_func import register_dataset
from lmms_engine.utils.train_utils import TrainUtilities


@register_dataset("custom_vision")
class CustomVisionDataset(MultiModalDataset):
    
    def load_from_json(self, data, data_folder=None) -> Dict[str, torch.Tensor]:
        messages = data["messages"]
        images_list = []
        
        for message in messages:
            for content in message["content"]:
                if content["type"] == "image_url":
                    images_list.append(content["image_url"]["url"])
        
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        
        if data_folder is not None:
            images = [
                Image.open(os.path.join(data_folder, img)) 
                for img in images_list
            ]
        else:
            images = [Image.open(img) for img in images_list] if images_list else None
        
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages
        )
        return inputs
    
    def load_from_hf(self, data) -> Dict[str, torch.Tensor]:
        messages = data["messages"]
        hf_messages = TrainUtilities.convert_open_to_hf(messages)
        images = data.get("image", None)
        
        inputs = self.processor.process(
            images=images, 
            hf_messages=hf_messages
        )
        return inputs
    
    def get_collator(self):
        return VisionCollator(self.processor)
```

## Testing Your Dataset

After implementation, test your dataset:

```python
from lmms_engine.datasets.config import DatasetConfig

config = DatasetConfig(
    dataset_type="my_dataset",
    dataset_format="json",
    dataset_path="path/to/data.json",
    processor_config={"processor_type": "qwen_processor"}
)

dataset = MyCustomDataset(config)
dataset.build()

# Test access
sample = dataset[0]
print(sample.keys())  # Should have input_ids, attention_mask, etc.

# Test collator
collator = dataset.get_collator()
batch = collator([dataset[i] for i in range(4)])
```

## Common Issues

### Issue: AttributeError for `load_from_json`
**Cause**: Method signature mismatch
**Solution**: Ensure your method signature matches: `load_from_json(self, data, data_folder=None)`

### Issue: Missing media files
**Cause**: Incorrect path construction
**Solution**: Always use `os.path.join(data_folder, path)` when data_folder is provided

### Issue: Processor returns empty tensors
**Cause**: Incorrect message format for processor
**Solution**: Use `TrainUtilities.convert_open_to_hf()` to standardize format

## See Also

- [Dataset Configuration Reference](../reference/dataset_configuration.md)
- [Video Configuration](../reference/video_configuration.md)
- [API Reference](../reference/api.md)