# Qwen-VL Model Training Guide Qwen-VL models are state-of-the-art multimodal models that support image and video understanding. This guide covers training both Qwen2.5-VL and Qwen3-VL models using the LMMS Engine. ## Overview ### Qwen2.5-VL - **Architecture**: Advanced vision-language model with M-RoPE (Multimodal Rotary Position Embedding) - **Position Encoding**: 3D RoPE for temporal (T), height (H), width (W) dimensions - **Modalities**: Image and Video understanding - **Context Length**: Up to 128K tokens - **Key Features**: 3D M-RoPE, Dynamic resolution ViT, Flash Attention 2, Liger Kernel, RMPad, Sequence Parallelism ### Qwen3-VL - **Architecture**: Latest generation with Interleaved-MRoPE and DeepStack visual feature fusion - **Position Encoding**: Interleaved-MRoPE with enhanced text-timestamp alignment - **Unique Feature**: DeepStack - multi-layer visual embeddings fused into early language model layers - **Modalities**: Image and Video understanding (optimized for long videos) - **Context Length**: 256K tokens (native), extendable to 1M tokens - **Key Features**: DeepStack fusion, Interleaved 3D M-RoPE, Long video support (>1 hour), Flash Attention 2, Sequence Parallelism ## Prerequisites - LMMS Engine installation - CUDA-compatible GPU with sufficient memory - PyTorch with FSDP2 support - Flash Attention 2 (recommended) ### Install Flash Attention ```bash uv pip install flash-attn --no-build-isolation ``` If you encounter symbol errors: ```bash uv pip install --no-build-isolation --no-cache-dir flash-attn ``` ## Quick Start ### 1. Prepare Your Dataset Prepare your dataset in OpenAI chat messages format with image/video/audio content types. See [Data Preparation Guide](../user_guide/data_prep.md) for details. Example data structure: ```json { "messages": [ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "path/to/image.jpg"}}, {"type": "text", "text": "Describe this image"} ] }, { "role": "assistant", "content": [ {"type": "text", "text": "This image shows..."} ] } ] } ``` ### 2. Configure Training Create a YAML configuration file for your model. ## Training Configuration (Example) ### Qwen2.5-VL Configuration ```yaml - type: trainer config: trainer_type: fsdp2_trainer # Dataset configuration dataset_config: dataset_type: vision # Or vision_audio for audio support dataset_format: yaml datasets: - path: "path/to/your/dataset.parquet" data_folder: "" data_type: parquet # Processor configuration processor_config: processor_name: "Qwen/Qwen2.5-VL-7B-Instruct" # Or 3B/72B variants processor_type: "qwen2_5_vl" # Packing configuration packing: true packing_strategy: first_fit packing_length: 16384 # Video configuration video_backend: qwen_vl_utils video_sampling_strategy: fps video_max_pixels: 50176 # 224 * 224 video_max_frames: 512 fps: 1 # Model configuration model_config: load_from_pretrained_path: "Qwen/Qwen2.5-VL-7B-Instruct" attn_implementation: "flash_attention_2" # Training hyperparameters per_device_train_batch_size: 1 learning_rate: 1.0e-06 weight_decay: 0.0 gradient_accumulation_steps: 1 gradient_checkpointing: true num_train_epochs: 1 save_steps: 100 save_total_limit: 1 report_to: "wandb" output_dir: "./output/qwen2_5_vl" warmup_ratio: 0.0 run_name: "qwen2_5_vl_training" eval_strategy: "no" logging_steps: 1 group_by_length: true dataloader_num_workers: 8 bf16: true lr_scheduler_type: "cosine" # Optional: Freeze vision encoder freeze_modules: ["visual"] # Performance optimizations use_liger_kernel: true use_rmpad: true # FSDP2 configuration fsdp2: true fsdp_config: transformer_layer_cls_to_wrap: ["Qwen2_5_VLDecoderLayer"] reshard_after_forward: false # Optional: Sequence parallelism sp_ulysses_degree: 1 # Set to 2, 4, 8 for sequence parallel ``` ### Qwen3-VL Configuration ```yaml - type: trainer config: trainer_type: fsdp2_trainer # Dataset configuration dataset_config: dataset_type: qwen3_vl_iterable # Use iterable dataset for Qwen3-VL dataset_format: yaml datasets: - path: "path/to/your/dataset.parquet" data_folder: "" data_type: parquet # Processor configuration processor_config: processor_name: "Qwen/Qwen3-VL-8B-Instruct" # Or 4B variant processor_type: "qwen3_vl" # Packing configuration packing: false # Note: packing for Qwen3-VL packing_length: 51200 filter_overlong: true # Video configuration - Qwen3-VL optimized video_backend: qwen_vl_utils video_sampling_strategy: fps video_max_pixels: 50176 # 224 * 224 video_max_frames: 512 fps: 1 # Model configuration model_config: load_from_pretrained_path: "Qwen/Qwen3-VL-8B-Instruct" attn_implementation: "flash_attention_2" # Training hyperparameters per_device_train_batch_size: 1 learning_rate: 2.0e-04 # Slightly higher for Qwen3-VL weight_decay: 0.0 gradient_accumulation_steps: 1 gradient_checkpointing: true max_steps: 1000 # Use max_steps for iterable dataset save_steps: 1000 save_total_limit: 1 report_to: "wandb" output_dir: "./output/qwen3_vl" warmup_ratio: 0.1 run_name: "qwen3_vl_training" eval_strategy: "no" logging_steps: 1 dataloader_num_workers: 8 bf16: true lr_scheduler_type: "cosine" # Performance optimizations use_liger_kernel: true use_rmpad: true # FSDP2 configuration fsdp2: true fsdp_config: transformer_layer_cls_to_wrap: ["Qwen3VLTextDecoderLayer"] reshard_after_forward: false # Optional: Sequence parallelism sp_ulysses_degree: 1 ``` ## Key Configuration Parameters ### Dataset Type (Example) | Model | dataset_type | Description | |-------|-------------------------|-------------| | **Qwen2.5-VL** | `vision` | Map-style dataset, supports packing | | **Qwen3-VL** | `qwen3_vl_iterable` | Streaming dataset optimized for Qwen3-VL | ### Processor Configuration - **processor_name**: HuggingFace model identifier - Qwen2.5-VL: `Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-7B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct` - Qwen3-VL: `Qwen/Qwen3-VL-4B-Instruct`, `Qwen/Qwen3-VL-8B-Instruct` - **processor_type**: Must match the model series - Qwen2.5-VL: `"qwen2_5_vl"` - Qwen3-VL: `"qwen3_vl"` ### FSDP2 Configuration FSDP2 (Fully Sharded Data Parallel v2) is recommended for training large Qwen-VL models: ```yaml fsdp2: true fsdp_config: # Qwen2.5-VL transformer_layer_cls_to_wrap: ["Qwen2_5_VLDecoderLayer"] # include "Qwen3VLVisionBlock" to wrap ViT layers # Qwen3-VL # transformer_layer_cls_to_wrap: ["Qwen3VLTextDecoderLayer"] reshard_after_forward: false # If true, reshard parameters after each forward pass (saves memory but increases communication) ``` ## Advanced Features ### Sequence Parallelism Both Qwen2.5-VL and Qwen3-VL support Ulysses-style sequence parallelism for long context training: ```yaml trainer_args: sp_ulysses_degree: 2 # Sequence parallel degree (1, 2, 4, 8) ``` **Benefits**: - Enables training with longer sequences - Reduces memory per GPU - Scales efficiently across GPUs **Requirements**: - Flash Attention 2 must be installed - `use_rmpad: true` recommended - Number of attention heads must be divisible by `sp_ulysses_degree` ### Liger Kernel [Liger Kernel](https://github.com/linkedin/Liger-Kernel) provides fused kernels for efficient training: ```yaml trainer_args: use_liger_kernel: true ``` **Optimizations**: - Fused CrossEntropy kernel (~30% memory reduction) - Fused RMSNorm - Fused RoPE - Fused SwiGLU ### RMPad (Remove Padding) RMPad removes padding tokens for more efficient computation: ```yaml trainer_args: use_rmpad: true ``` **Benefits**: - ~15-25% speedup by removing pad token computation - Works seamlessly with Flash Attention 2 - Essential for packing efficiency ### Freezing Modules Freeze the vision encoder for faster training when only fine-tuning language understanding: ```yaml trainer_args: freeze_modules: ["visual"] ``` ### Mixed Precision Training - **bf16**: Recommended for stability and performance - **fp16**: Alternative if bf16 not supported ```yaml trainer_args: bf16: true # Preferred # fp16: true # Alternative ``` ### Gradient Checkpointing Reduces memory at the cost of computation: ```yaml trainer_args: gradient_checkpointing: true ``` ## Run Training ### Launch Command ```bash export NCCL_BLOCKING_WAIT=0 export TOKENIZERS_PARALLELISM=false # Optional: HuggingFace setup export HF_TOKEN="" export HF_HOME="$HOME/.cache/huggingface" export HF_HUB_ENABLE_HF_TRANSFER="1" export NCCL_DEBUG=INFO CONFIG="your_config.yaml" torchrun --nproc_per_node=8 \ --nnodes=1 \ --node_rank=0 \ --master_addr=127.0.0.1 \ --master_port=8000 \ -m lmms_engine.launch.cli config_yaml=${CONFIG} ``` ### Multi-Node Training ```bash # Node 0 torchrun --nproc_per_node=8 \ --nnodes=2 \ --node_rank=0 \ --master_addr= \ --master_port=8000 \ -m lmms_engine.launch.cli config_yaml=${CONFIG} # Node 1 torchrun --nproc_per_node=8 \ --nnodes=2 \ --node_rank=1 \ --master_addr= \ --master_port=8000 \ -m lmms_engine.launch.cli config_yaml=${CONFIG} \ hydra.output_subdir=null hydra/job_logging=disabled ``` In multi-node training, simultaneous starts cause Hydra conflicts due to [timestamped working directories](https://hydra.cc/docs/configure_hydra/workdir/). Use `hydra.output_subdir=null` and `hydra/job_logging=disabled` to fix this. ## Model Architecture Details ### Qwen2.5-VL Architecture **Core Components**: - **Language Model**: Qwen2.5 decoder architecture (e.g., 3B/7B/72B variants) - **Vision Encoder**: ViT-based encoder with dynamic resolution support - **Position Encoding**: **M-RoPE (Multimodal Rotary Position Embedding)** - Separate position encodings for temporal (T), height (H), width (W) dimensions - Enables better alignment of visual tokens with text sequences - Uses `mrope_section` parameter to split RoPE across 3 dimensions - Computed via `apply_multimodal_rotary_pos_emb` with RoPE deltas - **Video Processing**: - Temporal-aware processing using RoPE deltas - Supports temporal grid (T, H, W) for video frames - Native video token integration in language model - **Context Length**: Up to 128K tokens - **Modality Support**: Image, Video, and optional Audio (via audio encoder) **Key Features**: - Dynamic resolution ViT allows variable image sizes - M-RoPE provides fine-grained spatial-temporal position encoding - Unified multimodal token processing in language model ### Qwen3-VL Architecture **Core Components**: - **Language Model**: Qwen3 decoder architecture (e.g., 4B/8B variants) with efficiency improvements - **Vision Encoder**: Enhanced ViT with multi-layer feature extraction - **Position Encoding**: **Interleaved-MRoPE** - Improved version of M-RoPE with better text-timestamp alignment - Optimized for long video processing with second-level indexing - Enhanced temporal understanding for video sequences - **DeepStack Feature** (Unique to Qwen3-VL): - Extracts visual features from multiple vision encoder layers - Fuses multi-layer visual embeddings into language model's early layers - Provides fine-grained visual-language alignment - Reference: [DeepStack Paper](https://arxiv.org/abs/2406.04334) - **Video Processing**: - Optimized for long videos (supports >1 hour) - Second-level timestamp alignment with text - Enhanced temporal reasoning capabilities - **Context Length**: Native support for 256K tokens, extendable to 1M tokens - **Modality Support**: Image and Video (optimized for long-form video understanding) **Key Features**: - DeepStack multi-layer visual feature fusion - Interleaved-MRoPE for superior temporal alignment - Extended context length for long videos and documents - Improved efficiency in video token processing ### Architecture Comparison | Feature | Qwen2.5-VL | Qwen3-VL | |---------|-----------|----------| | **Position Encoding** | M-RoPE (3D: T, H, W) | Interleaved-MRoPE | | **Visual Feature Fusion** | Single-layer fusion | DeepStack multi-layer fusion | | **Video Temporal Alignment** | RoPE deltas | Second-level timestamp alignment | | **Context Length** | 128K tokens | 256K-1M tokens | | **Long Video Support** | Good | Excellent (>1 hour) | | **Model Sizes** | 3B, 7B, 72B | 4B, 8B | | **Primary Use Case** | General multimodal | Long-form video & document understanding | ### Model Selection Guide **Choose Qwen2.5-VL if you:** - Need audio understanding capabilities - Want larger model options (72B for best performance) - Require general-purpose multimodal understanding - Work with images, short-medium videos, and audio - Need mature, well-tested architecture **Choose Qwen3-VL if you:** - Focus on long video understanding (>1 hour) - Need extended context length (>128K tokens) - Require fine-grained visual-language alignment (DeepStack) - Work primarily with video analysis and temporal reasoning - Want improved efficiency with smaller model sizes - Need second-level timestamp alignment for videos **Performance Considerations**: - **Qwen2.5-VL 7B**: Balanced choice for most multimodal tasks - **Qwen2.5-VL 72B**: Best performance, requires significant compute - **Qwen3-VL 8B**: Optimal for long video understanding with moderate compute - **Qwen3-VL 4B**: Efficient choice for video tasks with limited resources ## Troubleshooting ### Common Issues #### 1. Out of Memory (OOM) **Solutions**: - Reduce `per_device_train_batch_size` - Enable `gradient_checkpointing: true` - Reduce `video_max_pixels` or `video_max_frames` - Increase `gradient_accumulation_steps` - Enable sequence parallelism with `sp_ulysses_degree: 2` #### 2. Flash Attention Installation Issues **Problem**: Symbol not found or compilation errors **Solution**: ```bash # Clear cache and reinstall pip uninstall flash-attn -y uv pip install --no-build-isolation --no-cache-dir flash-attn ``` #### 3. Slow Training Speed **Optimizations**: - Enable `use_liger_kernel: true` - Enable `use_rmpad: true` - Enable `group_by_length: true` for better batching - Increase `dataloader_num_workers` - Use `bf16` instead of `fp16` - Enable packing for Qwen2.5-VL: `packing: true` #### 4. Video Loading Errors **Problem**: Video cannot be loaded or processed **Solutions**: - Ensure `qwen-vl-utils` is installed: `pip install qwen-vl-utils` - Check video file format compatibility - Reduce `video_max_frames` if videos are too long - Verify `video_backend: qwen_vl_utils` is set #### 5. Qwen3-VL Dataset Length Unknown **Problem**: Can't calculate steps per epoch with iterable dataset **Solution**: Always use `max_steps` instead of `num_train_epochs`: ```yaml trainer_args: max_steps: 1000 # Required for iterable datasets # num_train_epochs: 1 # Required for map-style datasets ``` ## Performance Tips ### Optimizing Training Speed 1. **Use appropriate batch size**: - Start with `per_device_train_batch_size: 1` - Increase `gradient_accumulation_steps` to simulate larger batches 2. **Enable all optimizations**: ```yaml use_liger_kernel: true use_rmpad: true group_by_length: true bf16: true ``` 3. **Video preprocessing**: - Use lower `fps` for faster loading (e.g., `fps: 0.5` for 1 frame per 2 seconds) - Reduce `video_max_frames` if full video not needed 4. **Sequence parallelism for long sequences**: - Set `sp_ulysses_degree: 2` or higher for sequences > 32K tokens ### Memory Management 1. **Estimate memory usage**: - 7B model with batch_size=1: ~40GB - 72B model with batch_size=1: ~150GB 2. **Reduce memory footprint**: - Enable gradient checkpointing - Use FSDP2 for multi-GPU training - Freeze visual encoder if only training language understanding ## Best Practices 1. **Start with pretrained models**: Always use official Qwen checkpoints from HuggingFace 2. **Use BF16 training**: More stable than FP16 for these models 3. **Enable packing for Qwen2.5-VL**: Significantly improves throughput 4. **Monitor training metrics**: Use WandB or TensorBoard for tracking 5. **Save checkpoints frequently**: Set reasonable `save_steps` values 6. **Test with small dataset first**: Verify configuration before full training ## Model Variants ### Qwen2.5-VL | Model | Parameters | Context Length | Recommended Use | |-------|-----------|----------------|-----------------| | Qwen2.5-VL-3B-Instruct | 3B | 128K | Fast inference, limited resources | | Qwen2.5-VL-7B-Instruct | 7B | 128K | Balanced performance and efficiency | | Qwen2.5-VL-72B-Instruct | 72B | 128K | Best performance, requires significant resources | ### Qwen3-VL | Model | Parameters | Context Length | Recommended Use | |-------|-----------|----------------|-----------------| | Qwen3-VL-4B-Instruct | 4B | Extended | Efficient training and inference | | Qwen3-VL-8B-Instruct | 8B | Extended | Enhanced performance with DeepStack | ## Additional Resources ### Official Documentation - [Qwen2.5-VL Blog](https://qwenlm.github.io/blog/qwen2-vl/) - [Qwen3-VL Announcement](https://qwenlm.github.io/) - [Qwen2.5-VL HuggingFace](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) - [Qwen3-VL HuggingFace](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) ### Technical Papers - [DeepStack: Multi-Layer Visual Feature Fusion](https://arxiv.org/abs/2406.04334) - The paper behind Qwen3-VL's unique architecture - [M-RoPE: Multimodal Rotary Position Embedding](https://arxiv.org/abs/2308.10882) - Position encoding for multimodal models ### LMMS Engine Guides - [Data Preparation Guide](../user_guide/data_prep.md) - [Dataset Configuration](../user_guide/datasets.md) - [Video Configuration Reference](../reference/video_configuration.md) - [Design Principles](../reference/design_principle.md) ### Community Resources - [LMMS Engine GitHub](https://github.com/EvolvingLMMs-Lab/lmms-engine) - [Qwen GitHub](https://github.com/QwenLM/Qwen2-VL)