dLLM (Diffusion Language Model) Training
Overview
dLLM (Diffusion Language Model) is a masked diffusion-based language model adapted from Qwen3 architecture.
Supported Features
Feature |
Support |
|---|---|
FSDP2 |
✅ |
USP |
❌ |
Muon Optimizer |
✅ |
Liger Kernel |
❌ |
Packing |
❌ |
NSA |
❌ |
Expert Parallelism |
❌ |
Highlights: Masked diffusion language model
Quick Start
See the example configuration and run script:
Example Configs: examples/diffusion_language_model/
Run Script: examples/diffusion_language_model/run.sh
Documentation: examples/diffusion_language_model/README.md
Available Configurations
dllm_train_muon_single_gpu.yaml: Single GPU with Muon optimizerdllm_train_muon_multi_gpu_fsdp2.yaml: Multi-GPU FSDP2 with Muondllm_train_adam_multi_gpu_deepspeed.yaml: Multi-GPU DeepSpeed with Adam
Key Configuration
trainer_type: dllm_trainer
dataset_config:
dataset_type: fineweb_edu
dataset_format: hf_dataset
dataset_path: HuggingFaceFW/fineweb-edu
extra_kwargs:
collator_type: dllm
model_config:
load_from_config:
model_type: qwen3_dllm
config:
vocab_size: 151936
hidden_size: 1024
num_hidden_layers: 24
trainer_args:
use_muon: true
fsdp2: true
bf16: true
Architecture
dLLM uses non-causal attention for masked diffusion language modeling, enabling bidirectional context understanding.