SiT (Scalable Interpolant Transformer) Training

Overview

SiT is a diffusion transformer for image generation using interpolant framework, supporting ImageNet-1K training with Classifier-Free Guidance (CFG).

Supported Features

Feature

Support

FSDP2

USP

Muon Optimizer

Liger Kernel

Packing

NSA

Expert Parallelism

Highlights: Interpolant Transformer, CFG, ImageNet-1K

Quick Start

See the example configuration and run script:

Model Variants

Model

Parameters

Hidden Size

Depth

Heads

SiT-S/2

~33M

384

12

6

SiT-B/2

~130M

768

12

12

SiT-L/2

~458M

1024

24

16

SiT-XL/2

~675M

1152

28

16

Key Configuration

model_config:
  load_from_config:
    model_type: "sit"
    hidden_size: 1152      # XL model
    depth: 28              # XL model
    num_heads: 16
    vae_path: "stabilityai/sd-vae-ft-ema"
    path_type: "Linear"
    prediction: "velocity"
    cfg_scale: 1.0

trainer_args:
  bf16: true
  fsdp2: true

Features

  • Interpolant Paths: Linear, GVP, VP

  • EMA: Exponential Moving Average for stable generation

  • CFG: Classifier-Free Guidance support

  • VAE: Stable Diffusion VAE for latent space encoding