Qwen2.5-Omni Training

Overview

Qwen2.5-Omni is a unified multimodal model supporting image, audio, and text understanding.

Supported Features

Feature

Support

FSDP2

USP

Muon Optimizer

Liger Kernel

Packing

NSA

Expert Parallelism

Highlights: Unified multimodal (image, audio, text)

Quick Start

See the example configuration and run script:

Key Configuration

dataset_config:
  dataset_type: qwen_omni_iterable
  processor_config:
    processor_type: Qwen2_5OmniProcessor
    audio_max_length: 60
  video_backend: qwen_omni_utils

model_config:
  load_from_pretrained_path: Qwen/Qwen2.5-Omni-7B
  attn_implementation: flash_attention_2

trainer_args:
  use_liger_kernel: true
  use_rmpad: true
  fsdp2: true