Qwen3-VL MoE Training
Overview
Qwen3-VL MoE is a Mixture-of-Experts vision-language model supporting image, video, and text with Expert Parallelism.
Supported Features
Feature |
Support |
|---|---|
FSDP2 |
✅ |
USP |
✅ |
Muon Optimizer |
✅ |
Liger Kernel |
✅ |
Packing |
✅ |
NSA |
❌ |
Expert Parallelism (EP) |
✅ |
Highlights: Vision-Language MoE with EP (image, video, text)
Quick Start
See the example configuration and run script:
Example Config: examples/qwen3_vl_moe/qwen3_vl_moe_ep8.yaml
Run Script: examples/qwen3_vl_moe/run.sh
Key Configuration
dataset_config:
dataset_type: qwen3_vl_iterable
processor_config:
processor_type: qwen3_vl
model_config:
load_from_pretrained_path: "Qwen/Qwen3-VL-30B-A3B-Instruct"
attn_implementation: flash_attention_2
monkey_patch_kwargs:
patch_type: ["liger"]
fused_linear_cross_entropy: true
trainer_args:
use_liger_kernel: true
use_rmpad: true
fsdp2: true
fsdp_config:
transformer_layer_cls_to_wrap: ["Qwen3VLMoeTextDecoderLayer"]
ep_degree: 8 # Expert Parallelism degree
Expert Parallelism
Expert Parallelism (EP) distributes MoE experts across GPUs to scale training. Set ep_degree to match your available GPUs (e.g., 2, 4, 8).