# Qwen3-MoE Training ## Overview Qwen3-MoE is a Mixture-of-Experts language model with Expert Parallelism support. ## Supported Features | Feature | Support | |---------|---------| | **FSDP2** | ✅ | | **USP** | ❌ | | **Muon Optimizer** | ✅ | | **Liger Kernel** | ✅ | | **Packing** | ✅ | | **NSA** | ❌ | | **Expert Parallelism (EP)** | ✅ | **Highlights**: Mixture-of-Experts with Expert Parallelism ## Quick Start See the example configuration and run script: - **Example Config**: [examples/qwen3_moe/qwen3_moe_ep8.yaml](../../examples/qwen3_moe/qwen3_moe_ep8.yaml) - **Run Script**: [examples/qwen3_moe/run.sh](../../examples/qwen3_moe/run.sh) ## Key Configuration ```yaml dataset_config: dataset_type: vision_iterable processor_config: processor_type: qwen2 model_config: load_from_pretrained_path: "Qwen/Qwen3-VL-30B-A3B-Instruct" attn_implementation: flash_attention_2 monkey_patch_kwargs: patch_type: ["liger"] fused_linear_cross_entropy: true trainer_args: use_liger_kernel: true use_rmpad: true fsdp2: true fsdp_config: transformer_layer_cls_to_wrap: ["Qwen3MoeDecoderLayer"] ep_degree: 8 # Expert Parallelism degree ``` ## Expert Parallelism Expert Parallelism (EP) distributes MoE experts across GPUs. Configure `ep_degree` to match your GPU count (e.g., 2, 4, 8).