# FLA Models (DGN) Training ## Overview FLA (Fast Linear Attention) models including Gated DeltaNet (DGN) provide efficient architectures with FineWeb-Edu pretraining support. ## Supported Features | Feature | Support | |---------|---------| | **FSDP2** | ✅ | | **USP** | ❌ | | **Muon Optimizer** | ✅ | | **Liger Kernel** | ❌ | | **Packing** | ✅ | | **NSA** | ❌ | | **Expert Parallelism** | ❌ | **Highlights**: Efficient architecture, FineWeb-Edu pretraining ## Quick Start See the example configuration and run script: - **Example Config**: [examples/dgn/train_dgn_1b.yaml](../../examples/dgn/train_dgn_1b.yaml) - **Run Script**: [examples/dgn/run.sh](../../examples/dgn/run.sh) ## Key Configuration ```yaml dataset_config: dataset_type: fineweb_edu dataset_format: hf_dataset dataset_path: HuggingFaceFW/fineweb-edu packing_length: 2048 model_config: load_from_config: model_type: gated_deltanet config: vocab_size: 151936 hidden_size: 1024 intermediate_size: 4096 num_hidden_layers: 24 trainer_args: use_muon: true fsdp2: true bf16: true ``` ## About FLA FLA models use linear attention mechanisms for improved efficiency on long sequences compared to standard attention.