Qwen3-MoE Training
Overview
Qwen3-MoE is a Mixture-of-Experts language model with Expert Parallelism support.
Supported Features
Feature |
Support |
|---|---|
FSDP2 |
✅ |
USP |
❌ |
Muon Optimizer |
✅ |
Liger Kernel |
✅ |
Packing |
✅ |
NSA |
❌ |
Expert Parallelism (EP) |
✅ |
Highlights: Mixture-of-Experts with Expert Parallelism
Quick Start
See the example configuration and run script:
Example Config: examples/qwen3_moe/qwen3_moe_ep8.yaml
Run Script: examples/qwen3_moe/run.sh
Key Configuration
dataset_config:
dataset_type: vision_iterable
processor_config:
processor_type: qwen2
model_config:
load_from_pretrained_path: "Qwen/Qwen3-VL-30B-A3B-Instruct"
attn_implementation: flash_attention_2
monkey_patch_kwargs:
patch_type: ["liger"]
fused_linear_cross_entropy: true
trainer_args:
use_liger_kernel: true
use_rmpad: true
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap: ["Qwen3MoeDecoderLayer"]
ep_degree: 8 # Expert Parallelism degree
Expert Parallelism
Expert Parallelism (EP) distributes MoE experts across GPUs. Configure ep_degree to match your GPU count (e.g., 2, 4, 8).