Qwen-VL Model Training Guide

Qwen-VL models are state-of-the-art multimodal models that support image and video understanding. This guide covers training both Qwen2.5-VL and Qwen3-VL models using the LMMS Engine.

Overview

Qwen2.5-VL

  • Architecture: Advanced vision-language model with M-RoPE (Multimodal Rotary Position Embedding)

  • Position Encoding: 3D RoPE for temporal (T), height (H), width (W) dimensions

  • Modalities: Image and Video understanding

  • Context Length: Up to 128K tokens

  • Key Features: 3D M-RoPE, Dynamic resolution ViT, Flash Attention 2, Liger Kernel, RMPad, Sequence Parallelism

Qwen3-VL

  • Architecture: Latest generation with Interleaved-MRoPE and DeepStack visual feature fusion

  • Position Encoding: Interleaved-MRoPE with enhanced text-timestamp alignment

  • Unique Feature: DeepStack - multi-layer visual embeddings fused into early language model layers

  • Modalities: Image and Video understanding (optimized for long videos)

  • Context Length: 256K tokens (native), extendable to 1M tokens

  • Key Features: DeepStack fusion, Interleaved 3D M-RoPE, Long video support (>1 hour), Flash Attention 2, Sequence Parallelism

Prerequisites

  • LMMS Engine installation

  • CUDA-compatible GPU with sufficient memory

  • PyTorch with FSDP2 support

  • Flash Attention 2 (recommended)

Install Flash Attention

uv pip install flash-attn --no-build-isolation

If you encounter symbol errors:

uv pip install --no-build-isolation --no-cache-dir flash-attn

Quick Start

1. Prepare Your Dataset

Prepare your dataset in OpenAI chat messages format with image/video/audio content types. See Data Preparation Guide for details.

Example data structure:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "path/to/image.jpg"}},
        {"type": "text", "text": "Describe this image"}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "This image shows..."}
      ]
    }
  ]
}

2. Configure Training

Create a YAML configuration file for your model.

Training Configuration (Example)

Qwen2.5-VL Configuration

- type: trainer
  config:
    trainer_type: fsdp2_trainer
    
    # Dataset configuration
    dataset_config:
      dataset_type: vision                    # Or vision_audio for audio support
      dataset_format: yaml
      
      datasets:
        - path: "path/to/your/dataset.parquet"
          data_folder: ""
          data_type: parquet
      
      # Processor configuration
      processor_config:
        processor_name: "Qwen/Qwen2.5-VL-7B-Instruct"  # Or 3B/72B variants
        processor_type: "qwen2_5_vl"
      
      # Packing configuration
      packing: true
      packing_strategy: first_fit
      packing_length: 16384
      
      # Video configuration
      video_backend: qwen_vl_utils
      video_sampling_strategy: fps
      video_max_pixels: 50176                 # 224 * 224
      video_max_frames: 512
      fps: 1
    
    # Model configuration
    model_config:
      load_from_pretrained_path: "Qwen/Qwen2.5-VL-7B-Instruct"
      attn_implementation: "flash_attention_2"
    
    # Training hyperparameters
    per_device_train_batch_size: 1
    learning_rate: 1.0e-06
    weight_decay: 0.0
    gradient_accumulation_steps: 1
    gradient_checkpointing: true
    num_train_epochs: 1
    save_steps: 100
    save_total_limit: 1
    report_to: "wandb"
    output_dir: "./output/qwen2_5_vl"
    warmup_ratio: 0.0
    run_name: "qwen2_5_vl_training"
    eval_strategy: "no"
    logging_steps: 1
    group_by_length: true
    dataloader_num_workers: 8
    bf16: true
    lr_scheduler_type: "cosine"
    
    # Optional: Freeze vision encoder
    freeze_modules: ["visual"]
    
    # Performance optimizations
    use_liger_kernel: true
    use_rmpad: true
    
    # FSDP2 configuration
    fsdp2: true
    fsdp_config:
      transformer_layer_cls_to_wrap: ["Qwen2_5_VLDecoderLayer"]
      reshard_after_forward: false
    
    # Optional: Sequence parallelism
    sp_ulysses_degree: 1                       # Set to 2, 4, 8 for sequence parallel

Qwen3-VL Configuration

- type: trainer
  config:
    trainer_type: fsdp2_trainer
    
    # Dataset configuration
    dataset_config:
      dataset_type: qwen3_vl_iterable           # Use iterable dataset for Qwen3-VL
      dataset_format: yaml
      
      datasets:
        - path: "path/to/your/dataset.parquet"
          data_folder: ""
          data_type: parquet
      
      # Processor configuration
      processor_config:
        processor_name: "Qwen/Qwen3-VL-8B-Instruct"  # Or 4B variant
        processor_type: "qwen3_vl"
      
      # Packing configuration
      packing: false                             # Note: packing for Qwen3-VL
      packing_length: 51200
      filter_overlong: true
      
      # Video configuration - Qwen3-VL optimized
      video_backend: qwen_vl_utils
      video_sampling_strategy: fps
      video_max_pixels: 50176                    # 224 * 224
      video_max_frames: 512
      fps: 1
    
    # Model configuration
    model_config:
      load_from_pretrained_path: "Qwen/Qwen3-VL-8B-Instruct"
      attn_implementation: "flash_attention_2"
    
    # Training hyperparameters
    per_device_train_batch_size: 1
    learning_rate: 2.0e-04                       # Slightly higher for Qwen3-VL
    weight_decay: 0.0
    gradient_accumulation_steps: 1
    gradient_checkpointing: true
    max_steps: 1000                              # Use max_steps for iterable dataset
    save_steps: 1000
    save_total_limit: 1
    report_to: "wandb"
    output_dir: "./output/qwen3_vl"
    warmup_ratio: 0.1
    run_name: "qwen3_vl_training"
    eval_strategy: "no"
    logging_steps: 1
    dataloader_num_workers: 8
    bf16: true
    lr_scheduler_type: "cosine"
    
    # Performance optimizations
    use_liger_kernel: true
    use_rmpad: true
    
    # FSDP2 configuration
    fsdp2: true
    fsdp_config:
      transformer_layer_cls_to_wrap: ["Qwen3VLTextDecoderLayer"]
      reshard_after_forward: false
    
    # Optional: Sequence parallelism
    sp_ulysses_degree: 1

Key Configuration Parameters

Dataset Type (Example)

Model

dataset_type

Description

Qwen2.5-VL

vision

Map-style dataset, supports packing

Qwen3-VL

qwen3_vl_iterable

Streaming dataset optimized for Qwen3-VL

Processor Configuration

  • processor_name: HuggingFace model identifier

    • Qwen2.5-VL: Qwen/Qwen2.5-VL-3B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct

    • Qwen3-VL: Qwen/Qwen3-VL-4B-Instruct, Qwen/Qwen3-VL-8B-Instruct

  • processor_type: Must match the model series

    • Qwen2.5-VL: "qwen2_5_vl"

    • Qwen3-VL: "qwen3_vl"

FSDP2 Configuration

FSDP2 (Fully Sharded Data Parallel v2) is recommended for training large Qwen-VL models:

fsdp2: true
fsdp_config:
  # Qwen2.5-VL
  transformer_layer_cls_to_wrap: ["Qwen2_5_VLDecoderLayer"] # include "Qwen3VLVisionBlock" to wrap ViT layers
  
  # Qwen3-VL
  # transformer_layer_cls_to_wrap: ["Qwen3VLTextDecoderLayer"]
  
  reshard_after_forward: false # If true, reshard parameters after each forward pass (saves memory but increases communication)

Advanced Features

Sequence Parallelism

Both Qwen2.5-VL and Qwen3-VL support Ulysses-style sequence parallelism for long context training:

trainer_args:
  sp_ulysses_degree: 2  # Sequence parallel degree (1, 2, 4, 8)

Benefits:

  • Enables training with longer sequences

  • Reduces memory per GPU

  • Scales efficiently across GPUs

Requirements:

  • Flash Attention 2 must be installed

  • use_rmpad: true recommended

  • Number of attention heads must be divisible by sp_ulysses_degree

Liger Kernel

Liger Kernel provides fused kernels for efficient training:

trainer_args:
  use_liger_kernel: true

Optimizations:

  • Fused CrossEntropy kernel (~30% memory reduction)

  • Fused RMSNorm

  • Fused RoPE

  • Fused SwiGLU

RMPad (Remove Padding)

RMPad removes padding tokens for more efficient computation:

trainer_args:
  use_rmpad: true

Benefits:

  • ~15-25% speedup by removing pad token computation

  • Works seamlessly with Flash Attention 2

  • Essential for packing efficiency

Freezing Modules

Freeze the vision encoder for faster training when only fine-tuning language understanding:

trainer_args:
  freeze_modules: ["visual"]

Mixed Precision Training

  • bf16: Recommended for stability and performance

  • fp16: Alternative if bf16 not supported

trainer_args:
  bf16: true          # Preferred
  # fp16: true        # Alternative

Gradient Checkpointing

Reduces memory at the cost of computation:

trainer_args:
  gradient_checkpointing: true

Run Training

Launch Command

export NCCL_BLOCKING_WAIT=0
export TOKENIZERS_PARALLELISM=false

# Optional: HuggingFace setup
export HF_TOKEN="<YOUR HF_TOKEN>"
export HF_HOME="$HOME/.cache/huggingface"
export HF_HUB_ENABLE_HF_TRANSFER="1"

export NCCL_DEBUG=INFO

CONFIG="your_config.yaml"

torchrun --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=127.0.0.1 \
    --master_port=8000 \
    -m lmms_engine.launch.cli config_yaml=${CONFIG}

Multi-Node Training

# Node 0
torchrun --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=<MASTER_NODE_IP> \
    --master_port=8000 \
    -m lmms_engine.launch.cli config_yaml=${CONFIG}

# Node 1
torchrun --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr=<MASTER_NODE_IP> \
    --master_port=8000 \
    -m lmms_engine.launch.cli config_yaml=${CONFIG} \
    hydra.output_subdir=null hydra/job_logging=disabled

In multi-node training, simultaneous starts cause Hydra conflicts due to timestamped working directories. Use hydra.output_subdir=null and hydra/job_logging=disabled to fix this.

Model Architecture Details

Qwen2.5-VL Architecture

Core Components:

  • Language Model: Qwen2.5 decoder architecture (e.g., 3B/7B/72B variants)

  • Vision Encoder: ViT-based encoder with dynamic resolution support

  • Position Encoding: M-RoPE (Multimodal Rotary Position Embedding)

    • Separate position encodings for temporal (T), height (H), width (W) dimensions

    • Enables better alignment of visual tokens with text sequences

    • Uses mrope_section parameter to split RoPE across 3 dimensions

    • Computed via apply_multimodal_rotary_pos_emb with RoPE deltas

  • Video Processing:

    • Temporal-aware processing using RoPE deltas

    • Supports temporal grid (T, H, W) for video frames

    • Native video token integration in language model

  • Context Length: Up to 128K tokens

  • Modality Support: Image, Video, and optional Audio (via audio encoder)

Key Features:

  • Dynamic resolution ViT allows variable image sizes

  • M-RoPE provides fine-grained spatial-temporal position encoding

  • Unified multimodal token processing in language model

Qwen3-VL Architecture

Core Components:

  • Language Model: Qwen3 decoder architecture (e.g., 4B/8B variants) with efficiency improvements

  • Vision Encoder: Enhanced ViT with multi-layer feature extraction

  • Position Encoding: Interleaved-MRoPE

    • Improved version of M-RoPE with better text-timestamp alignment

    • Optimized for long video processing with second-level indexing

    • Enhanced temporal understanding for video sequences

  • DeepStack Feature (Unique to Qwen3-VL):

    • Extracts visual features from multiple vision encoder layers

    • Fuses multi-layer visual embeddings into language model’s early layers

    • Provides fine-grained visual-language alignment

    • Reference: DeepStack Paper

  • Video Processing:

    • Optimized for long videos (supports >1 hour)

    • Second-level timestamp alignment with text

    • Enhanced temporal reasoning capabilities

  • Context Length: Native support for 256K tokens, extendable to 1M tokens

  • Modality Support: Image and Video (optimized for long-form video understanding)

Key Features:

  • DeepStack multi-layer visual feature fusion

  • Interleaved-MRoPE for superior temporal alignment

  • Extended context length for long videos and documents

  • Improved efficiency in video token processing

Architecture Comparison

Feature

Qwen2.5-VL

Qwen3-VL

Position Encoding

M-RoPE (3D: T, H, W)

Interleaved-MRoPE

Visual Feature Fusion

Single-layer fusion

DeepStack multi-layer fusion

Video Temporal Alignment

RoPE deltas

Second-level timestamp alignment

Context Length

128K tokens

256K-1M tokens

Long Video Support

Good

Excellent (>1 hour)

Model Sizes

3B, 7B, 72B

4B, 8B

Primary Use Case

General multimodal

Long-form video & document understanding

Model Selection Guide

Choose Qwen2.5-VL if you:

  • Need audio understanding capabilities

  • Want larger model options (72B for best performance)

  • Require general-purpose multimodal understanding

  • Work with images, short-medium videos, and audio

  • Need mature, well-tested architecture

Choose Qwen3-VL if you:

  • Focus on long video understanding (>1 hour)

  • Need extended context length (>128K tokens)

  • Require fine-grained visual-language alignment (DeepStack)

  • Work primarily with video analysis and temporal reasoning

  • Want improved efficiency with smaller model sizes

  • Need second-level timestamp alignment for videos

Performance Considerations:

  • Qwen2.5-VL 7B: Balanced choice for most multimodal tasks

  • Qwen2.5-VL 72B: Best performance, requires significant compute

  • Qwen3-VL 8B: Optimal for long video understanding with moderate compute

  • Qwen3-VL 4B: Efficient choice for video tasks with limited resources

Troubleshooting

Common Issues

1. Out of Memory (OOM)

Solutions:

  • Reduce per_device_train_batch_size

  • Enable gradient_checkpointing: true

  • Reduce video_max_pixels or video_max_frames

  • Increase gradient_accumulation_steps

  • Enable sequence parallelism with sp_ulysses_degree: 2

2. Flash Attention Installation Issues

Problem: Symbol not found or compilation errors

Solution:

# Clear cache and reinstall
pip uninstall flash-attn -y
uv pip install --no-build-isolation --no-cache-dir flash-attn

3. Slow Training Speed

Optimizations:

  • Enable use_liger_kernel: true

  • Enable use_rmpad: true

  • Enable group_by_length: true for better batching

  • Increase dataloader_num_workers

  • Use bf16 instead of fp16

  • Enable packing for Qwen2.5-VL: packing: true

4. Video Loading Errors

Problem: Video cannot be loaded or processed

Solutions:

  • Ensure qwen-vl-utils is installed: pip install qwen-vl-utils

  • Check video file format compatibility

  • Reduce video_max_frames if videos are too long

  • Verify video_backend: qwen_vl_utils is set

5. Qwen3-VL Dataset Length Unknown

Problem: Can’t calculate steps per epoch with iterable dataset

Solution: Always use max_steps instead of num_train_epochs:

trainer_args:
  max_steps: 1000                # Required for iterable datasets
  # num_train_epochs: 1          # Required for map-style datasets

Performance Tips

Optimizing Training Speed

  1. Use appropriate batch size:

    • Start with per_device_train_batch_size: 1

    • Increase gradient_accumulation_steps to simulate larger batches

  2. Enable all optimizations:

    use_liger_kernel: true
    use_rmpad: true
    group_by_length: true
    bf16: true
    
  3. Video preprocessing:

    • Use lower fps for faster loading (e.g., fps: 0.5 for 1 frame per 2 seconds)

    • Reduce video_max_frames if full video not needed

  4. Sequence parallelism for long sequences:

    • Set sp_ulysses_degree: 2 or higher for sequences > 32K tokens

Memory Management

  1. Estimate memory usage:

    • 7B model with batch_size=1: ~40GB

    • 72B model with batch_size=1: ~150GB

  2. Reduce memory footprint:

    • Enable gradient checkpointing

    • Use FSDP2 for multi-GPU training

    • Freeze visual encoder if only training language understanding

Best Practices

  1. Start with pretrained models: Always use official Qwen checkpoints from HuggingFace

  2. Use BF16 training: More stable than FP16 for these models

  3. Enable packing for Qwen2.5-VL: Significantly improves throughput

  4. Monitor training metrics: Use WandB or TensorBoard for tracking

  5. Save checkpoints frequently: Set reasonable save_steps values

  6. Test with small dataset first: Verify configuration before full training

Model Variants

Qwen2.5-VL

Model

Parameters

Context Length

Recommended Use

Qwen2.5-VL-3B-Instruct

3B

128K

Fast inference, limited resources

Qwen2.5-VL-7B-Instruct

7B

128K

Balanced performance and efficiency

Qwen2.5-VL-72B-Instruct

72B

128K

Best performance, requires significant resources

Qwen3-VL

Model

Parameters

Context Length

Recommended Use

Qwen3-VL-4B-Instruct

4B

Extended

Efficient training and inference

Qwen3-VL-8B-Instruct

8B

Extended

Enhanced performance with DeepStack

Additional Resources

Official Documentation

Technical Papers

LMMS Engine Guides

Community Resources