BAGEL Model Training Guide

BAGEL is a multimodal model that combines visual understanding and generation capabilities. This guide covers how to train BAGEL models using the LMMS Engine.

Overview

BAGEL integrates:

Language Model: Qwen2-based architecture with MoT (Mixture of Tokens) support
Vision Understanding: SigLIP vision transformer for image comprehension
Visual Generation: VAE (Variational Autoencoder) for image generation
Unified Training: Joint training for both understanding and generation tasks

Prerequisites

LMMS Engine installation
CUDA-compatible GPU with sufficient memory
PyTorch with FSDP2 support

QuickStart

1. Prepare Your Dataset

An example dataset is available on https://huggingface.co/datasets/kcz358/bagel-example

2. Overwrite config or load from converted weight

Since the original Bagel config is not hf compatible and can’t be used in AutoConfig, we prepare a config here https://huggingface.co/kcz358/bagel_hf/blob/main/config.json. Please overwrite this json file to the original config.json in the Bagel model path

You can also load from https://huggingface.co/lmms-lab/BAGEL-7B-MoT-ver.LE where we converted bagel into hf safetensor weight. The weight are the same with the original weights.

3. Configure Training

Create a YAML configuration file based on the template above, adjusting:

Dataset paths and format
Model checkpoint path
Training hyperparameters
Output directory

Basic Training Configuration

- type: trainer
  config:
    trainer_type: fsdp2_trainer
    
    # Dataset configuration
    dataset_config:
      dataset_type: bagel_iterable
      dataset_format: parquet  # Supports: parquet, arrow, json, jsonl, yaml
      
      datasets:
        - path: "parquet path"  # Hugging Face dataset ID
          data_folder: ""
          data_type: parquet
      
      # Processor configuration
      processor_config:
        processor_name: "your-model-checkpoint-path"
        processor_type: "bagel"
      
      # Packing configuration (recommended)
      packing: true
      packing_strategy: first_fit
      packing_length: 4096
      video_backend: qwen_vl_utils
    
    # Model configuration
    model_config:
      load_from_pretrained_path: "your-model-checkpoint-path"
      attn_implementation: "eager"  # or "sdpa"
      extra_kwargs:
        visual_und: false  # Enable/disable visual understanding
      # Optional: Enable Native Sparse Attention
      # monkey_patch_kwargs:
      #   patch_type: ["nsa"]
    
    # Training hyperparameters
    per_device_train_batch_size: 1
    learning_rate: 1.0e-06
    weight_decay: 0.0
    gradient_accumulation_steps: 1
    gradient_checkpointing: true
    max_steps: 1000
    save_steps: 500
    output_dir: "./output/bagel-training"
    bf16: true
    
    # FSDP2 configuration
    fsdp2: true
    fsdp_config:
      transformer_layer_cls_to_wrap: ["Qwen2MoTDecoderLayer"]
      reshard_after_forward: false

Dataset Format

BAGEL expects datasets with the following structure:

Required Fields

messages: Conversation format with roles (system, user, assistant)

Example Dataset Entry

{
  "messages": [
    {
      "role": "user", 
      "content": [
        { "type": "image_url", "image_url": { "url": "path/to/image.jpg" } },
        { "type": "text", "text": "Describe this image" }
      ]
    },
    {
      "role": "assistant",
      "content": [
        { "type": "text", "text": "This image shows..." }
      ]
    }
  ]
}

Key Configuration Options

To overwrite the arguments in Bagel training, we suggest the user to use the extra_kwargs in the config and get the parameters from that. You can check how it is being used in src/lmms_engine/datasets/processor/bagel_processor.py and src/lmms_engine/models/bagel/bagel.py in the from_pretrained method

FSDP2 Configuration

FSDP2 (Fully Sharded Data Parallel v2) is recommended for training large BAGEL models:

fsdp2: true
fsdp_config:
  transformer_layer_cls_to_wrap: ["Qwen2MoTDecoderLayer"]
  reshard_after_forward: false

Advanced Features

Native Sparse Attention (NSA) Support

We supports Native Sparse Attention training on BAGEL through monkey patching to improve memory efficiency and training speed for long sequences. NSA replaces the standard attention mechanism with a sparse variant that reduces computational complexity.

Prerequisites

Install the native sparse attention library:

pip install git+https://github.com/XunhaoLai/native-sparse-attention-triton.git

Configuration

Enable NSA by adding the monkey patch configuration to your model config:

model_config:
  load_from_pretrained_path: "your-model-checkpoint-path"
  attn_implementation: "eager/sdpa"
  extra_kwargs:
    visual_und: false
  monkey_patch_kwargs:
    patch_type: ["nsa"]
    # NSA configuration parameters (all optional with defaults shown)
    block_size: 64              # Size of attention blocks
    compress_type: "weightedpool"  # Options: weightedpool, linear, avgpool
    kernel_size: 32             # Compression kernel size
    kernel_stride: 16           # Compression kernel stride
    topk: 16                    # Number of top-k blocks to keep
    init_blocks: 1              # Number of initial blocks to always include
    local_blocks: 2             # Number of local blocks around current position
    window_size: 512            # Local attention window size

NSA Parameters

We recommend you to find out the meaning for parameters from here

Usage Notes

NSA is most beneficial for longer sequence
The sparse attention pattern is learned during training and adapts to the data
All NSA parameters can be tuned based on your specific use case and hardware constraints

Sequence Packing

BAGEL supports efficient sequence packing to maximize GPU utilization:

first_fit: Pack sequences to minimize padding
Configurable packing_length for optimal memory usage

Mixed Precision Training

bf16: Recommended for stability and performance
Automatic loss scaling for gradient stability

Gradient Checkpointing

Reduces memory usage at the cost of computation
Essential for training large models

Model Architecture Details

Components

Language Model: Qwen2 architecture with MoT extensions
Vision Encoder: SigLIP for image understanding
VAE: Autoencoder for image generation
Connectors: MLPs bridging different modalities

Training Objectives

Cross-entropy loss for language modeling
MSE loss for visual generation
Configurable loss weighting