Qwen3-VL MoE Training

Overview

Qwen3-VL MoE is a Mixture-of-Experts vision-language model supporting image, video, and text with Expert Parallelism.

Supported Features

Feature

Support

FSDP2

USP

Muon Optimizer

Liger Kernel

Packing

NSA

Expert Parallelism (EP)

Highlights: Vision-Language MoE with EP (image, video, text)

Quick Start

See the example configuration and run script:

Key Configuration

dataset_config:
  dataset_type: qwen3_vl_iterable
  processor_config:
    processor_type: qwen3_vl

model_config:
  load_from_pretrained_path: "Qwen/Qwen3-VL-30B-A3B-Instruct"
  attn_implementation: flash_attention_2
  monkey_patch_kwargs:
    patch_type: ["liger"]
    fused_linear_cross_entropy: true

trainer_args:
  use_liger_kernel: true
  use_rmpad: true
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen3VLMoeTextDecoderLayer"]
  ep_degree: 8  # Expert Parallelism degree

Expert Parallelism

Expert Parallelism (EP) distributes MoE experts across GPUs to scale training. Set ep_degree to match your available GPUs (e.g., 2, 4, 8).