dLLM (Diffusion Language Model) Training

Overview

dLLM (Diffusion Language Model) is a masked diffusion-based language model adapted from Qwen3 architecture.

Supported Features

Feature

Support

FSDP2

USP

Muon Optimizer

Liger Kernel

Packing

NSA

Expert Parallelism

Highlights: Masked diffusion language model

Quick Start

See the example configuration and run script:

Available Configurations

  • dllm_train_muon_single_gpu.yaml: Single GPU with Muon optimizer

  • dllm_train_muon_multi_gpu_fsdp2.yaml: Multi-GPU FSDP2 with Muon

  • dllm_train_adam_multi_gpu_deepspeed.yaml: Multi-GPU DeepSpeed with Adam

Key Configuration

trainer_type: dllm_trainer

dataset_config:
  dataset_type: fineweb_edu
  dataset_format: hf_dataset
  dataset_path: HuggingFaceFW/fineweb-edu
  extra_kwargs:
    collator_type: dllm

model_config:
  load_from_config:
    model_type: qwen3_dllm
    config:
      vocab_size: 151936
      hidden_size: 1024
      num_hidden_layers: 24

trainer_args:
  use_muon: true
  fsdp2: true
  bf16: true

Architecture

dLLM uses non-causal attention for masked diffusion language modeling, enabling bidirectional context understanding.