CuFlash-AttnFrom-Scratch CUDA FlashAttention
Technical Whitepaper · O(N) Memory · FP32/FP16 · Forward & Backward
Technical Whitepaper · O(N) Memory · FP32/FP16 · Forward & Backward
Handle 16K+ token sequences on a single GPU via FlashAttention tiling. No O(N²) attention matrices stored in HBM.
Algorithm Details →Pure CUDA C++ with no PyTorch, no Cutlass, no Triton. Understand every line. Modify every detail.
Kernel Deep Dive →Forward and backward passes with gradient recomputation. FP32 and FP16 with numerically-safe accumulation.
API Reference →Optimized kernels for Volta through Hopper (sm_70 → sm_90). V100, A100, H100, and consumer GPUs.
Benchmarks →Stable C ABI for easy integration with Python, Rust, or any language supporting FFI.
C API Docs →Docs, workflows, and repository structure stay intentionally minimal and aligned with the actual library.
Project Status →FlashAttention reduces memory from O(N²) to O(N), enabling training on much longer sequences.
| Sequence Length | Standard Attention | FlashAttention | Memory Saved |
|---|---|---|---|
| 1,024 | 4 MB | 8 KB | 99.8% |
| 4,096 | 64 MB | 32 KB | 99.95% |
| 16,384 | 1 GB | 128 KB | 99.99% |
| 65,536 | 16 GB | 512 KB | 99.97% |
Measured on NVIDIA A100 80GB with FP16 precision and causal masking.
| Configuration | FlashAttention | Standard | Speedup |
|---|---|---|---|
| Batch=1, Seq=1024 | 45.2 tok/s | 12.1 tok/s | 3.7x |
| Batch=8, Seq=1024 | 312.5 tok/s | 45.3 tok/s | 6.9x |
| Batch=32, Seq=1024 | 892.1 tok/s | 98.7 tok/s | 9.0x |
Build and run in under 5 minutes:
git clone https://github.com/AICL-Lab/cuflash-attn.git
cd cuflash-attn
cmake --preset release
cmake --build --preset release
ctest --preset release --output-on-failure#include "cuflash/flash_attention.h"
auto err = cuflash::flash_attention_forward(
d_Q, d_K, d_V, d_O, d_L,
batch_size, num_heads, seq_len, head_dim,
scale, true, stream
);import ctypes
lib = ctypes.CDLL("./build/release/libcuflash_attn.so")
lib.cuflash_attention_forward_f32(
q_ptr, k_ptr, v_ptr, o_ptr, l_ptr,
B, H, N, D, scale, True, None
)