Asynchronous Checkpoint Evaluation During Training

LMMs Engine supports asynchronous evaluation of model checkpoints during training. This allows you to evaluate your model without interrupting the training process, by submitting evaluation jobs to a separate LMMS-Eval server.

Overview

When enabled, the training system:

  1. Submits evaluation jobs to an LMMS-Eval server when checkpoints are saved

  2. Continues training while evaluations run in the background

  3. Polls for evaluation results periodically

  4. Logs evaluation metrics when they become available

Prerequisites

Start the LMMS-Eval Server

You need to run the LMMS-Eval server before starting training. The server will handle evaluation requests and return results.

# Start the LMMS-Eval server on your evaluation machine
python -m lmms_eval.entrypoints.server --port 8000

The server will listen for evaluation requests and perform evaluations asynchronously.

Configuration

Enable asynchronous evaluation in your training configuration YAML:

trainer_args:
  # Enable evaluation at specific intervals
  eval_strategy: "steps"  # Options: "steps", "epoch", "no"
  eval_steps: 500  # Evaluate every N steps (when eval_strategy="steps")
  
  # Evaluation configuration
  eval_config:
    # Server configuration
    server_url: "http://192.168.8.249:8000"
    poll_interval: 10.0  # Poll server every 10 seconds
    
    # Model configuration
    model: "qwen_vl"  # Model name recognized by LMMS-Eval
    checkpoint_key: "model"  # Key to use in model_args for checkpoint path
    
    # Tasks to evaluate
    tasks:
      - "mmmu_val"
      - "textvqa_val"
      - "docvqa_val"
    
    # Model arguments passed to LMMS-Eval
    model_args:
      num_gpus: 8
      batch_size: 256
      max_length: 2048
      # Additional model-specific arguments

Configuration Parameters

eval_strategy

  • "steps": Evaluate every eval_steps training steps

  • "epoch": Evaluate at the end of each epoch

  • "no": Disable evaluation (default)

eval_config Parameters

Parameter

Type

Description

server_url

string

URL of the LMMS-Eval server (e.g., "http://localhost:8000")

poll_interval

float

Interval (seconds) to poll for evaluation results (default: 10.0)

model

string

Model name recognized by LMMS-Eval (e.g., "qwen_vl")

tasks

list

List of evaluation tasks (e.g., ["mmmu_val", "textvqa_val"])

checkpoint_key

string

Key used in model_args to specify checkpoint path

model_args

dict

Additional arguments passed to the model (e.g., num_gpus, batch_size)

How It Works

1. Checkpoint Saving

When a checkpoint is saved (according to save_steps), the trainer:

  • Determines the checkpoint path (e.g., ./output/checkpoint-500)

  • Creates an evaluation output directory (e.g., ./output/checkpoint-500/eval)

  • Submits an evaluation job to the LMMS-Eval server

2. Background Polling

A background thread:

  • Polls the LMMS-Eval server every poll_interval seconds

  • Checks if evaluation jobs are completed

  • Retrieves results when available

3. Metric Logging

When evaluation results are available:

  • Metrics are logged to your tracking system (e.g., W&B, TensorBoard)

  • Metrics include global_step to associate results with the training step

  • Example logged metrics: eval/mmmu_val/accuracy, eval/textvqa_val/accuracy

4. Training Completion

At the end of training:

  • The trainer waits for all pending evaluation jobs to complete

  • All remaining evaluation results are logged

  • Training exits only after all evaluations are finished

Example Configuration

Here’s a complete example with asynchronous evaluation enabled:

trainer_type: fsdp2_trainer

dataset_config:
  dataset_type: vision
  dataset_format: yaml
  datasets:
    - path: data/your_dataset
      data_folder: ""
      data_type: arrow
  
  processor_config:
    processor_name: "Qwen/Qwen3-VL-8B-Instruct"
    processor_type: "qwen3_vl"
  
  packing: true
  packing_strategy: first_fit
  packing_length: 16384

model_config:
  load_from_pretrained_path: "Qwen/Qwen3-VL-8B-Instruct"
  attn_implementation: "flash_attention_2"

trainer_args:
  per_device_train_batch_size: 1
  learning_rate: 1.0e-06
  num_train_epochs: 1
  save_steps: 500
  eval_steps: 500  # Must equal save_steps for consistent evaluation
  eval_strategy: "steps"
  save_total_limit: 2
  
  # Evaluation configuration
  eval_config:
    server_url: "http://192.168.8.249:8000"
    poll_interval: 10.0
    checkpoint_key: "model"
    model: "qwen_vl"
    tasks:
      - "mmmu_val"
      - "textvqa_val"
    model_args:
      num_gpus: 8
      batch_size: 256
  
  report_to: "wandb"
  output_dir: "./output/qwen3_vl"
  bf16: true
  gradient_checkpointing: true
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen3VLDecoderLayer"]
    reshard_after_forward: false

EMA Checkpoint Evaluation

If you have EMA (Exponential Moving Average) enabled, the system will automatically evaluate both regular and EMA checkpoints:

trainer_args:
  ema_enabled: true
  ema_decay: 0.9999
  ema_update_every: 1
  
  eval_config:
    server_url: "http://192.168.8.249:8000"
    # ... other config

The trainer will:

  • Evaluate regular checkpoints with checkpoint_type: "regular"

  • Evaluate EMA checkpoints with checkpoint_type: "ema"

  • Log both sets of metrics separately

Distributed Training

In distributed training (e.g., with torchrun), only rank 0:

  • Submits evaluation jobs

  • Polls for results

  • Logs evaluation metrics

This avoids duplicate submissions and redundant logging.

Monitoring Evaluation Progress

Check W&B/TensorBoard

Evaluation metrics appear in your tracking dashboard:

  • eval/mmmu_val/accuracy

  • eval/textvqa_val/accuracy

  • eval/textvqa_val/anls

  • etc.

Each metric is associated with the training step via global_step.

Check Evaluation Server Logs

The LMMS-Eval server logs:

  • Received evaluation requests

  • Evaluation progress

  • Completed evaluations

Check Training Logs

The training process logs:

  • When evaluation jobs are submitted

  • When results are received

  • Any errors during polling or logging

Troubleshooting

Evaluations Not Starting

  1. Verify the LMMS-Eval server is running at server_url

  2. Check network connectivity from training machine to evaluation server

  3. Verify the checkpoint path exists and contains valid weights

Evaluation Results Not Appearing

  1. Check poll_interval - increase if network is slow

  2. Check LMMS-Eval server logs for errors

  3. Verify task names are correct and supported by LMMS-Eval

Duplicate Evaluations

Ensure eval_steps matches save_steps or adjust evaluation frequency to match checkpoint saving frequency.

Best Practices

  1. Network Bandwidth: Use a dedicated evaluation machine if network bandwidth is limited

  2. Resource Allocation: Allocate sufficient GPUs for evaluation in model_args.num_gpus

  3. Checkpoint Frequency: Balance between save_steps and evaluation frequency

  4. Task Selection: Choose representative tasks that don’t take too long

  5. Poll Interval: Adjust poll_interval based on your network and evaluation speed

  6. Output Management: Use save_total_limit to manage disk space for checkpoints

Additional Resources