## Train

To run training, prepare a YAML config. Below are two up-to-date examples that you can use as templates.

Following is an example config:

```yaml
trainer_type: fsdp2_trainer

# Dataset configuration - now includes the actual dataset definitions
dataset_config:
  dataset_type: vision
  dataset_format: yaml # Uses 'yaml' format for both external files and inline definitions

  # Inline dataset definitions (no dataset_path needed)
  datasets:
    - path: data/open_thoughts_debug
      data_folder: ""
      data_type: arrow

  # Processor configuration
  processor_config:
    processor_name: "Qwen/Qwen2.5-VL-7B-Instruct"
    processor_type: "qwen2_5_vl"

  # Packing configuration
  packing: true
  packing_strategy: first_fit
  packing_length: 16384

# Model configuration
model_config:
  load_from_pretrained_path: "Qwen/Qwen2.5-VL-7B-Instruct"
  attn_implementation: "flash_attention_2"

# Training arguments, mostly compatible with HuggingFace Trainer
trainer_args:
  per_device_train_batch_size: 1
  learning_rate: 1.0e-06 # we should use 1.0 to makes YAML recognize it as a float
  weight_decay: 0.0
  gradient_accumulation_steps: 1
  gradient_checkpointing: true
  num_train_epochs: 1
  save_steps: 100
  save_total_limit: 1
  report_to: "wandb"
  output_dir: "./output/debug"
  warmup_ratio: 0.0
  run_name: "qwen2_5_vl_config"
  eval_strategy: "no"
  logging_steps: 1
  group_by_length: true
  dataloader_num_workers: 8
  bf16: true
  lr_scheduler_type: "cosine"
  freeze_modules: ["visual"]
  use_liger_kernel: true
  use_rmpad: true
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen2_5_VLDecoderLayer"]
    reshard_after_forward: false
```

You can visit the `config.py` file under each subfolder to see what parameters are configurable

### Key fields

- **trainer_type**: Use `hf_trainer` for standard HF Trainer or `fsdp2_trainer` for PyTorch FSDP2.
- **dataset_config.dataset_format**: `yaml`. You can either set `dataset_path` to an external YAML, or embed datasets inline via `datasets`.
- **datasets**: Each entry defines `path`, optional `data_folder`, and `data_type` (e.g., `arrow`, `parquet`).
- **processor_config**: Set `processor_name` (e.g., a Hugging Face model id) and `processor_type` (e.g., `qwen2_5_vl`).
- **packing**: Enable sequence packing with `packing: true`, and adjust `packing_strategy` and `packing_length`. Use `filter_overlong` to drop samples exceeding limits.
- **video options**: `video_backend`, `video_sampling_strategy`, `video_max_pixels`, `video_max_frames` control video preprocessing.
- **model_config**: Prefer `load_from_pretrained_path` and set `attn_implementation` (e.g., `flash_attention_2`).
- **freeze_modules**: List of submodules (e.g., `visual`) to freeze during training.
- **use_liger_kernel/use_rmpad**: Performance optimizations. Keep enabled if supported on your stack.
- **fsdp2/fsdp_config**: Enable FSDP2 sharding and wrap transformer layer classes via `transformer_layer_cls_to_wrap`. Tune `reshard_after_forward` for memory/perf trade-offs.
- **EMA (Exponential Moving Average)**: Enable EMA with `ema_enabled: true`. Configure `ema_decay` (default 0.9999), `ema_update_every`, `ema_start_step`, and optionally filter parameters via `ema_param_filter`. EMA checkpoints are saved alongside regular checkpoints and can be merged using `merge_fsdp.py` with `--state_dict_dirname pytorch_ema_model_fsdp_0`.

## Run

Example launch command:

```bash
export NCCL_BLOCKING_WAIT=0
export TOKENIZERS_PARALLELISM=false

# Hugging Face setup (optional)
export HF_TOKEN="<YOUR HF_TOKEN>"
export HF_HOME="$HOME/.cache/huggingface"
export HF_HUB_ENABLE_HF_TRANSFER="1"

export NCCL_DEBUG=INFO

CONFIG=$1  # path to your YAML config

torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="8000" \
    -m lmms_engine.launch.cli config_yaml=${CONFIG}
```

## Run direct with cli and override with hydra

Instead of using a YAML config file, you can pass configuration directly via Hydra overrides on the command line. This is useful for quick experiments and parameter tuning.

### Basic Usage

Use the format `key=value` to override any configuration parameter. Hydra automatically creates the nested structure:

```bash
torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="8000" \
    -m lmms_engine.launch.cli \
    trainer_type=fsdp2_trainer \
    dataset_config.dataset_path=/path/to/video_dataset.yaml \
    dataset_config.dataset_format=yaml \
    dataset_config.dataset_type=qwen3_vl_iterable \
    dataset_config.processor_config.processor_name="Qwen/Qwen3-VL-8B-Instruct" \
    dataset_config.processor_config.processor_type=qwen3_vl \
    model_config.load_from_pretrained_path="Qwen/Qwen3-VL-8B-Instruct" \
    model_config.attn_implementation=flash_attention_2 \
    trainer_args.per_device_train_batch_size=1 \
    trainer_args.learning_rate=2.0e-04 \
    trainer_args.num_train_epochs=1 \
    trainer_args.output_dir=./output/debug \
    trainer_args.bf16=true
```

### Common Overrides

Here are frequently used parameters you can override:

**Dataset Configuration:**
- `dataset_config.dataset_path`: Path to your YAML dataset config
- `dataset_config.dataset_format`: Format type (e.g., `yaml`, `json`)
- `dataset_config.dataset_type`: Dataset type (e.g., `vision`, `qwen3_vl_iterable`)
- `dataset_config.processor_config.processor_name`: Model name for the processor
- `dataset_config.processor_config.processor_type`: Processor type to use
- `dataset_config.packing`: Enable/disable sequence packing (e.g., `packing=true`)
- `dataset_config.packing_length`: Max sequence length for packing
- `dataset_config.video_backend`: Video processing backend (e.g., `qwen_vl_utils`)
- `dataset_config.video_sampling_strategy`: Video sampling method (e.g., `fps`)
- `dataset_config.video_max_frames`: Maximum frames per video

**Model Configuration:**
- `model_config.load_from_pretrained_path`: Path or HF model ID to load from
- `model_config.attn_implementation`: Attention implementation (e.g., `flash_attention_2`)

**Training Arguments:**
- `trainer_args.per_device_train_batch_size`: Batch size per device
- `trainer_args.learning_rate`: Learning rate (use float notation like `2.0e-04`)
- `trainer_args.num_train_epochs`: Number of training epochs
- `trainer_args.max_steps`: Maximum training steps
- `trainer_args.gradient_accumulation_steps`: Gradient accumulation steps
- `trainer_args.gradient_checkpointing`: Enable gradient checkpointing
- `trainer_args.output_dir`: Output directory for checkpoints
- `trainer_args.run_name`: Name for this training run
- `trainer_args.bf16`: Use bfloat16 precision
- `trainer_args.fsdp2`: Enable FSDP2 distributed training
- `trainer_args.use_liger_kernel`: Enable Liger kernel optimizations
- `trainer_args.use_rmpad`: Enable padding removal optimization
- `trainer_args.ema_enabled`: Enable EMA (default: `false`)
- `trainer_args.ema_decay`: EMA decay rate (default: `0.9999`)
- `trainer_args.ema_update_every`: Update EMA every N steps (default: `1`)
- `trainer_args.ema_start_step`: Start EMA from step N (default: `0`)
- `trainer_args.ema_requires_grad_only`: Only apply EMA to trainable parameters (default: `true`)
- `trainer_args.ema_param_filter`: Filter parameters by name (supports `mode`, `include`, `exclude`)
- `trainer_args.ema_resume_from_ema`: Resume training from EMA weights (default: `false`)

### Advanced Example

See `examples/qwen3_vl/qwen3_vl_8b_train.sh` for a complete training script using Hydra overrides with comprehensive parameter configuration for multi-GPU training.

### Tips

- Use quotes for string values: `processor_name="Qwen/Qwen2.5-VL-7B-Instruct"`
- Use dot notation for nested configs: `trainer_args.learning_rate=1.0e-06`
- Boolean values: `packing=true` or `packing=false`
- For complex values (lists/arrays), use Hydra's syntax: `trainer_args.fsdp_config.transformer_layer_cls_to_wrap=["Qwen2_5_VLDecoderLayer"]`
- You can mix YAML config files with CLI overrides: `config_yaml=${CONFIG} trainer_args.learning_rate=1.0e-05` (CLI overrides take precedence)