Video Configuration Guide
This guide explains the video processing configuration options available in LMMs Engine and provides migration instructions for users upgrading from older versions.
Video Configuration Parameters
The following parameters can be configured in the dataset_config section of your training configuration:
Basic Video Parameters
video_backend(Optional[str], default: “qwen_vl_utils”)Specifies the backend to use for video loading
Available options:
"decord","qwen_vl_utils","qwen_omni_utils"Note: The
"torchvision"backend has been removed. See Migration Guide below.
video_sampling_strategy(Optional[str], default: “fps”)Determines how frames are sampled from videos
Options:
"fps": Sample frames based on frames per second"frame_num": Sample a fixed number of frames
Frame Sampling Parameters
fps(Optional[int], default: 1)Frames per second to sample when using
video_sampling_strategy: "fps"Must be a positive integer
frame_num(Optional[int], default: 64)Number of frames to sample when using
video_sampling_strategy: "frame_num"Must be a positive integer
Video Size Limits
video_max_pixels(Optional[int], default: 768 * 28 * 28)Maximum number of pixels per video frame
Helps control memory usage during training
Must be a positive integer
video_max_frames(Optional[int], default: 768)Maximum number of frames to load from a video
Prevents loading excessively long videos
Must be a positive integer
Filtering Options
filter_overlong(Optional[bool], default: True)When
packingis enabled, filter out samples that exceedpacking_lengthSet to
Falseto keep all samples regardless of length
Example Configuration
dataset_config:
dataset_type: "vision"
dataset_format: "json"
# Video configuration
video_backend: "qwen_vl_utils"
video_sampling_strategy: "fps"
fps: 2
video_max_pixels: 602112 # 768 * 28 * 28
video_max_frames: 512
# Packing configuration
packing: true
packing_length: 32000
filter_overlong: true
Processor Configuration for Video
When using the Qwen2.5-VL processor, you can also configure video-specific parameters through extra_kwargs:
processor_config:
processor_name: "Qwen/Qwen2.5-VL-Instruct"
processor_type: "qwen2_5_vl"
extra_kwargs:
video_max_pixels: 602112
video_min_pixels: 28800
Migration from Torchvision Backend
The torchvision video backend has been removed since it was implemented as a fallback in qwen-vl-utils
Migration Steps
Update your configuration file:
# Old configuration video_backend: "torchvision" # New configuration (recommended) video_backend: "qwen_vl_utils"
Install the new backend:
# For decord backend uv pip install decord # For qwen_vl_utils backend uv pip install qwen-vl-utils
Verify compatibility:
decordnaive decord video loading, used in load from cloud storageqwen_vl_utilsis optimized for Qwen models and provides additional featuresqwen_omni_utilssupports audio extraction from videos for Qwen Omni variants
Training Performance Optimization
Memory Management
The torch_empty_cache_steps parameter in the trainer configuration helps manage GPU memory:
# Clear CUDA cache every 100 steps
torch_empty_cache_steps: 100
This periodically clears the CUDA memory cache to prevent fragmentation during long training runs.
Troubleshooting
Common Issues
Video loading failures:
Check that the video file exists and is readable
Verify the video backend is properly installed
Review error logs for specific failure reasons
Out of memory errors:
Reduce
video_max_pixelsorvideo_max_framesEnable
filter_overlongto skip oversized samplesUse
torch_empty_cache_stepsto clear memory periodically
Validation errors:
Ensure all numeric parameters are positive integers
Check that
video_backendis one of the supported options
Best Practices
Start with conservative limits: Begin with smaller
video_max_pixelsandvideo_max_framesvalues, then increase as needed.Monitor memory usage: Use tools like
nvidia-smito track GPU memory during training.Choose the right backend:
We recommend to use
qwen_vl_utilsas it has much more features to config video loadingIf you want to load from the cloud storage,
decordis the only option now and is currently not configurable for video options
Optimize sampling strategy:
Use
"fps"for videos with consistent motionUse
"frame_num"when you need exactly N frames regardless of video length
Audio extraction from videos:
Use
video_backend: "qwen_omni_utils"withuse_audio_in_video: truein processor config to extract audio from video files
Audio from Video Extraction
When training Qwen Omni models, you can extract audio tracks from video files automatically.
Configuration
dataset_config:
dataset_type: vision_audio
video_backend: "qwen_omni_utils"
video_sampling_strategy: "fps"
fps: 1
video_max_frames: 60
processor_config:
processor_name: "Qwen/Qwen2.5-Omni-7B"
processor_type: "Qwen2_5OmniProcessor"
extra_kwargs:
use_audio_in_video: true
audio_max_length: 60