Video Configuration Guide

This guide explains the video processing configuration options available in LMMs Engine and provides migration instructions for users upgrading from older versions.

Video Configuration Parameters

The following parameters can be configured in the dataset_config section of your training configuration:

Basic Video Parameters

video_backend (Optional[str], default: “qwen_vl_utils”)
- Specifies the backend to use for video loading
- Available options: "decord", "qwen_vl_utils", "qwen_omni_utils"
- Note: The "torchvision" backend has been removed. See Migration Guide below.
video_sampling_strategy (Optional[str], default: “fps”)
- Determines how frames are sampled from videos
- Options:
  - "fps": Sample frames based on frames per second
  - "frame_num": Sample a fixed number of frames

Frame Sampling Parameters

fps (Optional[int], default: 1)
- Frames per second to sample when using video_sampling_strategy: "fps"
- Must be a positive integer
frame_num (Optional[int], default: 64)
- Number of frames to sample when using video_sampling_strategy: "frame_num"
- Must be a positive integer

Video Size Limits

video_max_pixels (Optional[int], default: 768 * 28 * 28)
- Maximum number of pixels per video frame
- Helps control memory usage during training
- Must be a positive integer
video_max_frames (Optional[int], default: 768)
- Maximum number of frames to load from a video
- Prevents loading excessively long videos
- Must be a positive integer

Filtering Options

filter_overlong (Optional[bool], default: True)
- When packing is enabled, filter out samples that exceed packing_length
- Set to False to keep all samples regardless of length

Example Configuration

dataset_config:
  dataset_type: "vision"
  dataset_format: "json"
  
  # Video configuration
  video_backend: "qwen_vl_utils"
  video_sampling_strategy: "fps"
  fps: 2
  video_max_pixels: 602112  # 768 * 28 * 28
  video_max_frames: 512
  
  # Packing configuration
  packing: true
  packing_length: 32000
  filter_overlong: true

Processor Configuration for Video

When using the Qwen2.5-VL processor, you can also configure video-specific parameters through extra_kwargs:

processor_config:
  processor_name: "Qwen/Qwen2.5-VL-Instruct"
  processor_type: "qwen2_5_vl"
  extra_kwargs:
    video_max_pixels: 602112
    video_min_pixels: 28800

Migration from Torchvision Backend

The torchvision video backend has been removed since it was implemented as a fallback in qwen-vl-utils

Migration Steps

Update your configuration file:

# Old configuration
video_backend: "torchvision"

# New configuration (recommended)
video_backend: "qwen_vl_utils"

Install the new backend:

# For decord backend
uv pip install decord

# For qwen_vl_utils backend
uv pip install qwen-vl-utils

Verify compatibility:
- decord naive decord video loading, used in load from cloud storage
- qwen_vl_utils is optimized for Qwen models and provides additional features
- qwen_omni_utils supports audio extraction from videos for Qwen Omni variants

Training Performance Optimization

Memory Management

The torch_empty_cache_steps parameter in the trainer configuration helps manage GPU memory:

# Clear CUDA cache every 100 steps
torch_empty_cache_steps: 100

This periodically clears the CUDA memory cache to prevent fragmentation during long training runs.

Troubleshooting

Common Issues

Video loading failures:
- Check that the video file exists and is readable
- Verify the video backend is properly installed
- Review error logs for specific failure reasons
Out of memory errors:
- Reduce video_max_pixels or video_max_frames
- Enable filter_overlong to skip oversized samples
- Use torch_empty_cache_steps to clear memory periodically
Validation errors:
- Ensure all numeric parameters are positive integers
- Check that video_backend is one of the supported options

Best Practices

Start with conservative limits: Begin with smaller video_max_pixels and video_max_frames values, then increase as needed.
Monitor memory usage: Use tools like nvidia-smi to track GPU memory during training.
Choose the right backend:
- We recommend to use qwen_vl_utils as it has much more features to config video loading
- If you want to load from the cloud storage, decord is the only option now and is currently not configurable for video options
Optimize sampling strategy:
- Use "fps" for videos with consistent motion
- Use "frame_num" when you need exactly N frames regardless of video length
Audio extraction from videos:
- Use video_backend: "qwen_omni_utils" with use_audio_in_video: true in processor config to extract audio from video files

Audio from Video Extraction

When training Qwen Omni models, you can extract audio tracks from video files automatically.

Configuration

dataset_config:
  dataset_type: vision_audio
  video_backend: "qwen_omni_utils"
  video_sampling_strategy: "fps"
  fps: 1
  video_max_frames: 60

  processor_config:
    processor_name: "Qwen/Qwen2.5-Omni-7B"
    processor_type: "Qwen2_5OmniProcessor"
    extra_kwargs:
      use_audio_in_video: true
      audio_max_length: 60