# Qwen3-VL MoE Training

## Overview

Qwen3-VL MoE is a Mixture-of-Experts vision-language model supporting image, video, and text with Expert Parallelism.

## Supported Features

| Feature | Support |
|---------|---------|
| **FSDP2** | ✅ |
| **USP** | ✅ |
| **Muon Optimizer** | ✅ |
| **Liger Kernel** | ✅ |
| **Packing** | ✅ |
| **NSA** | ❌ |
| **Expert Parallelism (EP)** | ✅ |

**Highlights**: Vision-Language MoE with EP (image, video, text)

## Quick Start

See the example configuration and run script:
- **Example Config**: [examples/qwen3_vl_moe/qwen3_vl_moe_ep8.yaml](../../examples/qwen3_vl_moe/qwen3_vl_moe_ep8.yaml)
- **Run Script**: [examples/qwen3_vl_moe/run.sh](../../examples/qwen3_vl_moe/run.sh)

## Key Configuration

```yaml
dataset_config:
  dataset_type: qwen3_vl_iterable
  processor_config:
    processor_type: qwen3_vl

model_config:
  load_from_pretrained_path: "Qwen/Qwen3-VL-30B-A3B-Instruct"
  attn_implementation: flash_attention_2
  monkey_patch_kwargs:
    patch_type: ["liger"]
    fused_linear_cross_entropy: true

trainer_args:
  use_liger_kernel: true
  use_rmpad: true
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen3VLMoeTextDecoderLayer"]
  ep_degree: 8  # Expert Parallelism degree
```

## Expert Parallelism

Expert Parallelism (EP) distributes MoE experts across GPUs to scale training. Set `ep_degree` to match your available GPUs (e.g., 2, 4, 8).