dLLM (Diffusion Language Model) Training

Overview

dLLM (Diffusion Language Model) is a masked diffusion-based language model adapted from Qwen3 architecture.

Supported Features

Feature	Support
FSDP2	✅
USP	❌
Muon Optimizer	✅
Liger Kernel	❌
Packing	❌
NSA	❌
Expert Parallelism	❌

Highlights: Masked diffusion language model

Quick Start

See the example configuration and run script:

Example Configs: examples/diffusion_language_model/
Run Script: examples/diffusion_language_model/run.sh
Documentation: examples/diffusion_language_model/README.md

Available Configurations

dllm_train_muon_single_gpu.yaml: Single GPU with Muon optimizer
dllm_train_muon_multi_gpu_fsdp2.yaml: Multi-GPU FSDP2 with Muon
dllm_train_adam_multi_gpu_deepspeed.yaml: Multi-GPU DeepSpeed with Adam

Key Configuration

trainer_type: dllm_trainer

dataset_config:
  dataset_type: fineweb_edu
  dataset_format: hf_dataset
  dataset_path: HuggingFaceFW/fineweb-edu
  extra_kwargs:
    collator_type: dllm

model_config:
  load_from_config:
    model_type: qwen3_dllm
    config:
      vocab_size: 151936
      hidden_size: 1024
      num_hidden_layers: 24

trainer_args:
  use_muon: true
  fsdp2: true
  bf16: true

Architecture

dLLM uses non-causal attention for masked diffusion language modeling, enabling bidirectional context understanding.