Skip to content

Architecture Overview

This page provides a comprehensive architectural view of CuFlash-Attn, designed for researchers and engineers who need to understand the system design.


System Architecture


Data Flow

Forward Pass

Backward Pass


Memory Layout


Kernel Tiling Strategy

Tile Dimensions

ParameterDescriptionTypical Value
B_rQuery tile size128
B_cKey/Value tile size64
DHead dimension64, 128
T_rThreads per query tile128

Memory Complexity

SRAM=O(Br×D+Bc×D+Br×Bc)

For typical values (Br=128,Bc=64,D=128):

SRAM=128×128+64×128+128×64=32KB

Directory Structure

cuflash-attn/
├── include/cuflash/          # Public API headers
│   ├── flash_attention.h     # C++ namespace API
│   └── flash_attention_c.h   # C ABI
├── src/
│   ├── api/                  # API dispatch layer
│   │   └── flash_attention_api.cu
│   ├── forward/              # Forward kernels
│   │   ├── forward_kernel_f32.cu
│   │   └── forward_kernel_f16.cu
│   ├── backward/             # Backward kernels
│   │   ├── backward_kernel_f32.cu
│   │   └── backward_kernel_f16.cu
│   └── kernels/              # Shared utilities
│       ├── softmax.cuh
│       └── memory.cuh
└── tests/
    ├── unit/                  # Unit tests
    └── integration/           # Integration tests

Error Handling Flow


Performance Characteristics

OperationMemoryComputeBandwidth Bound
ForwardO(N)O(N2)Yes (low D)
BackwardO(N)O(N2)Yes (low D)
RecomputeO(1)O(N2)Yes

Key Insight

FlashAttention reduces memory from O(N2) to O(N) by never materializing the full attention matrix. The trade-off is recomputing attention scores during the backward pass, which is compute-bound and thus efficient on modern GPUs.

Stable v0.3.0 baseline. Lean CUDA FlashAttention reference.