Qwen3-MoE Training

Overview

Qwen3-MoE is a Mixture-of-Experts language model with Expert Parallelism support.

Supported Features

Feature	Support
FSDP2	✅
USP	❌
Muon Optimizer	✅
Liger Kernel	✅
Packing	✅
NSA	❌
Expert Parallelism (EP)	✅

Highlights: Mixture-of-Experts with Expert Parallelism

Quick Start

See the example configuration and run script:

Example Config: examples/qwen3_moe/qwen3_moe_ep8.yaml
Run Script: examples/qwen3_moe/run.sh

Key Configuration

dataset_config:
  dataset_type: vision_iterable
  processor_config:
    processor_type: qwen2

model_config:
  load_from_pretrained_path: "Qwen/Qwen3-VL-30B-A3B-Instruct"
  attn_implementation: flash_attention_2
  monkey_patch_kwargs:
    patch_type: ["liger"]
    fused_linear_cross_entropy: true

trainer_args:
  use_liger_kernel: true
  use_rmpad: true
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen3MoeDecoderLayer"]
  ep_degree: 8  # Expert Parallelism degree

Expert Parallelism

Expert Parallelism (EP) distributes MoE experts across GPUs. Configure ep_degree to match your GPU count (e.g., 2, 4, 8).