Qwen2.5-Omni Training

Overview

Qwen2.5-Omni is a unified multimodal model supporting image, audio, and text understanding.

Supported Features

Feature	Support
FSDP2	✅
USP	✅
Muon Optimizer	✅
Liger Kernel	✅
Packing	✅
NSA	❌
Expert Parallelism	❌

Highlights: Unified multimodal (image, audio, text)

Quick Start

See the example configuration and run script:

Example Config: examples/qwen2_5_omni/example_config.yaml
Run Script: examples/qwen2_5_omni/run.sh

Key Configuration

dataset_config:
  dataset_type: qwen_omni_iterable
  processor_config:
    processor_type: Qwen2_5OmniProcessor
    audio_max_length: 60
  video_backend: qwen_omni_utils

model_config:
  load_from_pretrained_path: Qwen/Qwen2.5-Omni-7B
  attn_implementation: flash_attention_2

trainer_args:
  use_liger_kernel: true
  use_rmpad: true
  fsdp2: true