# Asynchronous Checkpoint Evaluation During Training

LMMs Engine supports asynchronous evaluation of model checkpoints during training. This allows you to evaluate your model without interrupting the training process, by submitting evaluation jobs to a separate LMMS-Eval server.

## Overview

When enabled, the training system:
1. Submits evaluation jobs to an LMMS-Eval server when checkpoints are saved
2. Continues training while evaluations run in the background
3. Polls for evaluation results periodically
4. Logs evaluation metrics when they become available

## Prerequisites

### Start the LMMS-Eval Server

You need to run the LMMS-Eval server before starting training. The server will handle evaluation requests and return results.

```bash
# Start the LMMS-Eval server on your evaluation machine
python -m lmms_eval.entrypoints.server --port 8000
```

The server will listen for evaluation requests and perform evaluations asynchronously.

## Configuration

Enable asynchronous evaluation in your training configuration YAML:

```yaml
trainer_args:
  # Enable evaluation at specific intervals
  eval_strategy: "steps"  # Options: "steps", "epoch", "no"
  eval_steps: 500  # Evaluate every N steps (when eval_strategy="steps")
  
  # Evaluation configuration
  eval_config:
    # Server configuration
    server_url: "http://192.168.8.249:8000"
    poll_interval: 10.0  # Poll server every 10 seconds
    
    # Model configuration
    model: "qwen_vl"  # Model name recognized by LMMS-Eval
    checkpoint_key: "model"  # Key to use in model_args for checkpoint path
    
    # Tasks to evaluate
    tasks:
      - "mmmu_val"
      - "textvqa_val"
      - "docvqa_val"
    
    # Model arguments passed to LMMS-Eval
    model_args:
      num_gpus: 8
      batch_size: 256
      max_length: 2048
      # Additional model-specific arguments
```

### Configuration Parameters

#### `eval_strategy`

- `"steps"`: Evaluate every `eval_steps` training steps
- `"epoch"`: Evaluate at the end of each epoch
- `"no"`: Disable evaluation (default)

#### `eval_config` Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `server_url` | string | URL of the LMMS-Eval server (e.g., `"http://localhost:8000"`) |
| `poll_interval` | float | Interval (seconds) to poll for evaluation results (default: `10.0`) |
| `model` | string | Model name recognized by LMMS-Eval (e.g., `"qwen_vl"`) |
| `tasks` | list | List of evaluation tasks (e.g., `["mmmu_val", "textvqa_val"]`) |
| `checkpoint_key` | string | Key used in model_args to specify checkpoint path |
| `model_args` | dict | Additional arguments passed to the model (e.g., `num_gpus`, `batch_size`) |

## How It Works

### 1. Checkpoint Saving

When a checkpoint is saved (according to `save_steps`), the trainer:
- Determines the checkpoint path (e.g., `./output/checkpoint-500`)
- Creates an evaluation output directory (e.g., `./output/checkpoint-500/eval`)
- Submits an evaluation job to the LMMS-Eval server

### 2. Background Polling

A background thread:
- Polls the LMMS-Eval server every `poll_interval` seconds
- Checks if evaluation jobs are completed
- Retrieves results when available

### 3. Metric Logging

When evaluation results are available:
- Metrics are logged to your tracking system (e.g., W&B, TensorBoard)
- Metrics include `global_step` to associate results with the training step
- Example logged metrics: `eval/mmmu_val/accuracy`, `eval/textvqa_val/accuracy`

### 4. Training Completion

At the end of training:
- The trainer waits for all pending evaluation jobs to complete
- All remaining evaluation results are logged
- Training exits only after all evaluations are finished

## Example Configuration

Here's a complete example with asynchronous evaluation enabled:

```yaml
trainer_type: fsdp2_trainer

dataset_config:
  dataset_type: vision
  dataset_format: yaml
  datasets:
    - path: data/your_dataset
      data_folder: ""
      data_type: arrow
  
  processor_config:
    processor_name: "Qwen/Qwen3-VL-8B-Instruct"
    processor_type: "qwen3_vl"
  
  packing: true
  packing_strategy: first_fit
  packing_length: 16384

model_config:
  load_from_pretrained_path: "Qwen/Qwen3-VL-8B-Instruct"
  attn_implementation: "flash_attention_2"

trainer_args:
  per_device_train_batch_size: 1
  learning_rate: 1.0e-06
  num_train_epochs: 1
  save_steps: 500
  eval_steps: 500  # Must equal save_steps for consistent evaluation
  eval_strategy: "steps"
  save_total_limit: 2
  
  # Evaluation configuration
  eval_config:
    server_url: "http://192.168.8.249:8000"
    poll_interval: 10.0
    checkpoint_key: "model"
    model: "qwen_vl"
    tasks:
      - "mmmu_val"
      - "textvqa_val"
    model_args:
      num_gpus: 8
      batch_size: 256
  
  report_to: "wandb"
  output_dir: "./output/qwen3_vl"
  bf16: true
  gradient_checkpointing: true
  fsdp2: true
  fsdp_config:
    transformer_layer_cls_to_wrap: ["Qwen3VLDecoderLayer"]
    reshard_after_forward: false
```

## EMA Checkpoint Evaluation

If you have EMA (Exponential Moving Average) enabled, the system will automatically evaluate both regular and EMA checkpoints:

```yaml
trainer_args:
  ema_enabled: true
  ema_decay: 0.9999
  ema_update_every: 1
  
  eval_config:
    server_url: "http://192.168.8.249:8000"
    # ... other config
```

The trainer will:
- Evaluate regular checkpoints with `checkpoint_type: "regular"`
- Evaluate EMA checkpoints with `checkpoint_type: "ema"`
- Log both sets of metrics separately

## Distributed Training

In distributed training (e.g., with `torchrun`), only rank 0:
- Submits evaluation jobs
- Polls for results
- Logs evaluation metrics

This avoids duplicate submissions and redundant logging.

## Monitoring Evaluation Progress

### Check W&B/TensorBoard

Evaluation metrics appear in your tracking dashboard:
- `eval/mmmu_val/accuracy`
- `eval/textvqa_val/accuracy`
- `eval/textvqa_val/anls`
- etc.

Each metric is associated with the training step via `global_step`.

### Check Evaluation Server Logs

The LMMS-Eval server logs:
- Received evaluation requests
- Evaluation progress
- Completed evaluations

### Check Training Logs

The training process logs:
- When evaluation jobs are submitted
- When results are received
- Any errors during polling or logging

## Troubleshooting

### Evaluations Not Starting

1. Verify the LMMS-Eval server is running at `server_url`
2. Check network connectivity from training machine to evaluation server
3. Verify the checkpoint path exists and contains valid weights

### Evaluation Results Not Appearing

1. Check `poll_interval` - increase if network is slow
2. Check LMMS-Eval server logs for errors
3. Verify task names are correct and supported by LMMS-Eval

### Duplicate Evaluations

Ensure `eval_steps` matches `save_steps` or adjust evaluation frequency to match checkpoint saving frequency.

## Best Practices

1. **Network Bandwidth**: Use a dedicated evaluation machine if network bandwidth is limited
2. **Resource Allocation**: Allocate sufficient GPUs for evaluation in `model_args.num_gpus`
3. **Checkpoint Frequency**: Balance between `save_steps` and evaluation frequency
4. **Task Selection**: Choose representative tasks that don't take too long
5. **Poll Interval**: Adjust `poll_interval` based on your network and evaluation speed
6. **Output Management**: Use `save_total_limit` to manage disk space for checkpoints

## Additional Resources

- [LMMS-Eval Repository](https://github.com/EvolvingLMMs-Lab/lmms-eval)
- [Merge FSDP Checkpoints](merge_fsdp.md)
- [Training Guide](../getting_started/train.md)