Qwen3-Omni MoE Training

Overview

Qwen3-Omni MoE is a multimodal Mixture-of-Experts model supporting image, audio, and text with Expert Parallelism.

Supported Features

Feature

Support

FSDP2

USP

Muon Optimizer

Liger Kernel

Packing

NSA

Expert Parallelism (EP)

Highlights: Multimodal MoE with EP (image, audio, text)

Quick Start

See the example configuration:

Key Configuration

dataset_config:
  dataset_type: qwen_omni_iterable
  processor_config:
    processor_type: Qwen2_5OmniProcessor
  video_backend: qwen_omni_utils

model_config:
  attn_implementation: flash_attention_2
  monkey_patch_kwargs:
    patch_type: ["liger"]

trainer_args:
  use_liger_kernel: true
  use_rmpad: true
  fsdp2: true
  ep_degree: 2  # Expert Parallelism degree

Expert Parallelism

Expert Parallelism (EP) distributes MoE experts across GPUs for efficient training. Set ep_degree based on your GPU availability.