Merging FSDP Checkpoints

LMMs Engine provides multiple ways to merge Fully Sharded Data Parallel (FSDP) model checkpoints into single consolidated checkpoints. This is particularly useful after training large models in a distributed setup.

Legacy: Using merge_fsdp.py Tool

The merge_fsdp.py script is a legacy utility that can still be used, but we recommend the built-in merger above.

python tools/merge_fsdp.py --input_dir <path_to_checkpoints> --model_name_or_class <model_name> --type <hf|fsdp2> [--output_dir <output_path>] [--step <checkpoint_step>] [--state_dict_dirname <dirname>] [--merge]

Arguments:

  • --input_dir: Directory containing the FSDP shards to merge

  • --model_name_or_class: The name or class of the model to load

  • --type: Type of checkpoint (hf or fsdp2)

  • --output_dir (optional): Directory to save the merged checkpoint

  • --step (optional): Specific checkpoint step to merge

  • --state_dict_dirname (optional): Subfolder name containing shards

  • --merge (optional): Merge all checkpoints by averaging weights

Examples:

# Hugging Face FSDP checkpoints
python tools/merge_fsdp.py --input_dir ./checkpoints --model_name_or_class Qwen/Qwen2.5-VL-7B-Instruct --type hf --output_dir ./merged_checkpoint

# FSDP version 2 checkpoints
python tools/merge_fsdp.py --input_dir ./checkpoints --type fsdp2

# EMA checkpoints
python tools/merge_fsdp.py --input_dir ./checkpoints --type fsdp2 --state_dict_dirname pytorch_ema_model_fsdp_0

Prerequisites

  • Ensure you have Python installed along with the required dependencies

  • Make sure the FSDP checkpoints are available in the specified directory

  • For FSDP2 checkpoints, the checkpoint directory should contain:

    • pytorch_model_fsdp_0/ for regular checkpoints

    • pytorch_ema_model_fsdp_0/ for EMA checkpoints (if EMA is enabled)

Evaluation

Manual Evaluation

After merging the checkpoints, you can evaluate the model using the lmms-eval tool. Refer to the lmms-eval repository for detailed instructions on setting up and running evaluations.

Automatic Evaluation During Training

LMMs Engine also supports asynchronous evaluation during training, which automatically merges FSDP2 checkpoints and evaluates them without interrupting training. See Asynchronous Checkpoint Evaluation for details.

How Automatic Merging Works

When using asynchronous evaluation, the system:

  1. Detects FSDP2 Checkpoints: Automatically identifies FSDP2-sharded checkpoints during training

  2. Merges Before Evaluation: The LMMS-Eval server handles merging of FSDP2 checkpoints using lmms_engine_kwargs

  3. Evaluates Merged Checkpoints: Runs evaluation on the merged checkpoint

  4. Returns Results: Evaluation results are polled and logged back to your training run

Configuration

Enable automatic merging and evaluation:

trainer_args:
  eval_strategy: "steps"
  eval_steps: 500
  save_steps: 500  # Must match eval_steps
  
  eval_config:
    server_url: "http://192.168.8.249:8000"
    poll_interval: 10.0
    checkpoint_key: "model"
    model: "qwen_vl"
    tasks:
      - "mmmu_val"
      - "textvqa_val"
    model_args:
      num_gpus: 8
      batch_size: 256

The LMMS-Eval server will automatically:

  • Detect FSDP2 checkpoint format

  • Merge the shards using the appropriate method

  • Evaluate the merged checkpoint

  • Return results to your training run

Benefits

  • No Manual Merging: Checkpoints are merged automatically as part of evaluation

  • Non-Blocking: Training continues while merging and evaluation happen in background

  • Distributed Evaluation: Merge and evaluation run on a separate server, freeing training resources

  • Automatic Tracking: Evaluation results are logged with the correct training step

See Asynchronous Checkpoint Evaluation for complete documentation.