SiT (Scalable Interpolant Transformer) Training
Overview
SiT is a diffusion transformer for image generation using interpolant framework, supporting ImageNet-1K training with Classifier-Free Guidance (CFG).
Supported Features
Feature |
Support |
|---|---|
FSDP2 |
✅ |
USP |
❌ |
Muon Optimizer |
✅ |
Liger Kernel |
❌ |
Packing |
❌ |
NSA |
❌ |
Expert Parallelism |
❌ |
Highlights: Interpolant Transformer, CFG, ImageNet-1K
Quick Start
See the example configuration and run script:
Example Config: examples/scalable_interpolant_transformer/sit_xl_2.yaml
Run Script: examples/scalable_interpolant_transformer/run.sh
Documentation: examples/scalable_interpolant_transformer/README.md
Model Variants
Model |
Parameters |
Hidden Size |
Depth |
Heads |
|---|---|---|---|---|
SiT-S/2 |
~33M |
384 |
12 |
6 |
SiT-B/2 |
~130M |
768 |
12 |
12 |
SiT-L/2 |
~458M |
1024 |
24 |
16 |
SiT-XL/2 |
~675M |
1152 |
28 |
16 |
Key Configuration
model_config:
load_from_config:
model_type: "sit"
hidden_size: 1152 # XL model
depth: 28 # XL model
num_heads: 16
vae_path: "stabilityai/sd-vae-ft-ema"
path_type: "Linear"
prediction: "velocity"
cfg_scale: 1.0
trainer_args:
bf16: true
fsdp2: true
Features
Interpolant Paths: Linear, GVP, VP
EMA: Exponential Moving Average for stable generation
CFG: Classifier-Free Guidance support
VAE: Stable Diffusion VAE for latent space encoding