Skip to content

CuFlash-AttnFrom-Scratch CUDA FlashAttention

Technical Whitepaper · O(N) Memory · FP32/FP16 · Forward & Backward

CuFlash-Attn
v0.3.0Stable
99.9%Memory Saved
8.9xMax Speedup
0Dependencies

O(N) Memory

Handle 16K+ token sequences on a single GPU via FlashAttention tiling. No O(N²) attention matrices stored in HBM.

Algorithm Details →
📦

Zero Dependencies

Pure CUDA C++ with no PyTorch, no Cutlass, no Triton. Understand every line. Modify every detail.

Kernel Deep Dive →
🔄

Full Training Support

Forward and backward passes with gradient recomputation. FP32 and FP16 with numerically-safe accumulation.

API Reference →
🎯

Multi-Architecture

Optimized kernels for Volta through Hopper (sm_70 → sm_90). V100, A100, H100, and consumer GPUs.

Benchmarks →
📐

C ABI Stable

Stable C ABI for easy integration with Python, Rust, or any language supporting FFI.

C API Docs →
🔬

Lean Maintenance

Docs, workflows, and repository structure stay intentionally minimal and aligned with the actual library.

Project Status →

⚡ Memory Efficiency

FlashAttention reduces memory from O(N²) to O(N), enabling training on much longer sequences.

Sequence LengthStandard AttentionFlashAttentionMemory Saved
1,0244 MB8 KB99.8%
4,09664 MB32 KB99.95%
16,3841 GB128 KB99.99%
65,53616 GB512 KB99.97%

🚀 Throughput Comparison

Measured on NVIDIA A100 80GB with FP16 precision and causal masking.

ConfigurationFlashAttentionStandardSpeedup
Batch=1, Seq=102445.2 tok/s12.1 tok/s3.7x
Batch=8, Seq=1024312.5 tok/s45.3 tok/s6.9x
Batch=32, Seq=1024892.1 tok/s98.7 tok/s9.0x

Quick Start

Build and run in under 5 minutes:

bash
git clone https://github.com/AICL-Lab/cuflash-attn.git
cd cuflash-attn

cmake --preset release
cmake --build --preset release

ctest --preset release --output-on-failure
cpp
#include "cuflash/flash_attention.h"

auto err = cuflash::flash_attention_forward(
    d_Q, d_K, d_V, d_O, d_L,
    batch_size, num_heads, seq_len, head_dim,
    scale, true, stream
);
python
import ctypes
lib = ctypes.CDLL("./build/release/libcuflash_attn.so")

lib.cuflash_attention_forward_f32(
    q_ptr, k_ptr, v_ptr, o_ptr, l_ptr,
    B, H, N, D, scale, True, None
)

Core References

FlashAttention — Dao et al., NeurIPS 2022.
arXiv:2205.14135
FlashAttention-2 — Dao, ICLR 2024.
arXiv:2307.08691
Online Softmax — Milakov & Gimelshein.
arXiv:1805.02867

Stable v0.3.0 baseline. Lean CUDA FlashAttention reference.