CuFlash-AttnFrom-Scratch CUDA FlashAttention

Technical Whitepaper · O(N) Memory · FP32/FP16 · Forward & Backward

Get Started

View on GitHub

v0.3.0Stable

99.9%Memory Saved

8.9xMax Speedup

0Dependencies

⚡

O(N) Memory

Handle 16K+ token sequences on a single GPU via FlashAttention tiling. No O(N²) attention matrices stored in HBM.

Algorithm Details →

📦

Zero Dependencies

Pure CUDA C++ with no PyTorch, no Cutlass, no Triton. Understand every line. Modify every detail.

Kernel Deep Dive →

🔄

Full Training Support

Forward and backward passes with gradient recomputation. FP32 and FP16 with numerically-safe accumulation.

API Reference →

🎯

Multi-Architecture

Optimized kernels for Volta through Hopper (sm_70 → sm_90). V100, A100, H100, and consumer GPUs.

Benchmarks →

📐

C ABI Stable

Stable C ABI for easy integration with Python, Rust, or any language supporting FFI.

C API Docs →

🔬

Lean Maintenance

Docs, workflows, and repository structure stay intentionally minimal and aligned with the actual library.

Project Status →

⚡ Memory Efficiency

FlashAttention reduces memory from O(N²) to O(N), enabling training on much longer sequences.

Sequence Length	Standard Attention	FlashAttention	Memory Saved
1,024	4 MB	8 KB	99.8%
4,096	64 MB	32 KB	99.95%
16,384	1 GB	128 KB	99.99%
65,536	16 GB	512 KB	99.97%

🚀 Throughput Comparison

Measured on NVIDIA A100 80GB with FP16 precision and causal masking.

Configuration	FlashAttention	Standard	Speedup
Batch=1, Seq=1024	45.2 tok/s	12.1 tok/s	3.7x
Batch=8, Seq=1024	312.5 tok/s	45.3 tok/s	6.9x
Batch=32, Seq=1024	892.1 tok/s	98.7 tok/s	9.0x

Quick Start

Build and run in under 5 minutes:

Clone & BuildC++ UsagePython Binding

bash

git clone https://github.com/AICL-Lab/cuflash-attn.git
cd cuflash-attn

cmake --preset release
cmake --build --preset release

ctest --preset release --output-on-failure

cpp

#include "cuflash/flash_attention.h"

auto err = cuflash::flash_attention_forward(
    d_Q, d_K, d_V, d_O, d_L,
    batch_size, num_heads, seq_len, head_dim,
    scale, true, stream
);

python

import ctypes
lib = ctypes.CDLL("./build/release/libcuflash_attn.so")

lib.cuflash_attention_forward_f32(
    q_ptr, k_ptr, v_ptr, o_ptr, l_ptr,
    B, H, N, D, scale, True, None
)

Quick StartPreset-based build and first steps AlgorithmTiling, online softmax, recomputation Kernel Deep DiveShared memory, warp scheduling API ReferenceComplete C++ and C ABI docs BenchmarksReproducible performance data Related WorkPapers and implementations

Core References

FlashAttention — Dao et al., NeurIPS 2022.
arXiv:2205.14135

FlashAttention-2 — Dao, ICLR 2024.
arXiv:2307.08691

Online Softmax — Milakov & Gimelshein.
arXiv:1805.02867