Skip to content

Roofline Analysis

A first-principles performance model for the CuFlash-Attn kernel.
We derive arithmetic intensity, locate the kernel on the roofline, and quantify why tiling shifts the bottleneck from HBM capacity to HBM bandwidth without crossing into the compute-bound regime.


1. The Roofline Model

The roofline model visualizes attainable performance as a function of arithmetic intensity (AI), defined as floating-point operations per byte of DRAM traffic:

AI=FLOPsBytes transferred to/from HBM

Two hardware roofs constrain execution:

RoofSymbolMeaning
Memory bandwidthβPeak bytes/sec the HBM interface can deliver (GB/s)
Compute peakπPeak FP16 TFLOPS the SM array can sustain

Attainable performance is the minimum of the two:

P=min(π,β×AI)

The ridge point is where the slanted memory roof intersects the flat compute roof:

AIridge=πβ

Kernels to the left of the ridge are memory-bound; kernels to the right are compute-bound.


2. Theoretical Peak Hardware Data

GPUArchitectureHBM BW (β)Peak FP16 Dense (π)Ridge Point (π/β)
V100SM70900 GB/s125 TFLOPS139 FLOP/Byte
A100SM802,039 GB/s312 TFLOPS153 FLOP/Byte
H100SM903,350 GB/s989 TFLOPS295 FLOP/Byte

Note: A100 and H100 figures use the Tensor Core dense-FP16 ratings without sparsity.
CuFlash-Attn runs primarily on Tensor Cores for the QKT and PV matmuls, but the dominant cost is the online softmax reduction, which is largely ALU/SFU-bound and memory-bound. Therefore the effective ridge point for our mixed workload is slightly lower than the raw dense peak.


3. FlashAttention Arithmetic Intensity Derivation

3.1 Standard (Materializing) Attention

For a single head with sequence length N and head dimension d:

  1. Compute S=QKT: 2Nd2 FLOPs → read Q (Nd), read K (Nd), write S (N2)
  2. Softmax P=softmax(S): O(N2) FLOPs → read S, write P
  3. Compute O=PV: 2N2d FLOPs → read P, read V (Nd), write O (Nd)

Total FLOPs (forward, causal):

FLOPsstd2N2d

Total HBM traffic (forward, ignoring reads of Q,K,V that can be fused):

Trafficstd4N2bytes(S+P)

Arithmetic intensity:

AIstd=2N2d4N2=d2=O(d)

With d=64:

AIstd32FLOP/Byte

3.2 FlashAttention (Tiled, SRAM-Resident)

FlashAttention partitions the N×N attention matrix into tiles of size Br×Bc that fit in shared memory / L1.
Crucially, it never materializes the full S or P matrices in HBM. Instead, it computes:

  • Online softmax statistics: running max m and sum for each row
  • Accumulated output tile Otile in SRAM
  • Only the final O (size Nd) is written back

Forward HBM traffic:

Trafficflash=2NdQ,K+NdV+NdO+2Nsoftmax stats (m,)4Ndbytes

Forward FLOPs remain 2N2d (causal).

Arithmetic intensity:

AIflash=2N2d4Nd=N2

Wait—this appears to grow with N. However, this derivation neglects the bytes needed to bring Q,K,V into SRAM repeatedly as tiles stream through the reduction loop. A more precise model accounts for the fact that each query block Qi (size Br×d) is loaded once, but each key block Kj (size Bc×d) and value block Vj are loaded NBc times.

Refined traffic (per FlashAttention-2 paper):

TrafficflashHBMΘ(Nd)+Θ(N2dM)

where M is SRAM capacity per SM. The second term is the streaming overhead of reloading K,V tiles across the outer loop. In practice, for our tile sizes (Br=128, Bc=64, M164 KB), the effective arithmetic intensity is:

AIflasheffective=O(d)

but with a significantly smaller constant in the denominator because the N2 intermediate matrices are eliminated. For d=64 and typical tile choices:

AIflasheffective60--90FLOP/Byte

This is roughly 2–3× higher than standard attention, yet still far left of the ridge point on all modern GPUs.

3.3 Why FlashAttention Is Still Memory-Bound

Even with tiling, the arithmetic intensity is O(d), not O(Nd). Because d is a small constant (32, 64, or 128 in CuFlash-Attn), we have:

head_dimAIflasheffectiveA100 Ridge (153)Regime
32~30–45 FLOP/Byte153Strongly memory-bound
64~60–90 FLOP/Byte153Memory-bound
128~120–180 FLOP/Byte153Near ridge / slightly compute-bound at very large N

At d=64 (our default), the kernel sits comfortably on the slanted memory-bandwidth roof. The only path to the compute flatline is:

  1. Increasing head dimension to 128+ (more FLOPs per element)
  2. Aggressive sequence-parallelism or tensor-parallelism that reuses K,V in registers
  3. Hopper-specific features (TMA, warp-group clusters) that reduce reload overhead

4. Why Tiling Increases Arithmetic Intensity

MechanismStandard AttentionFlashAttention (Tiled)Impact on AI
S=QKT storageWritten to HBM (N2)Kept in SRAM (transient)Eliminates 2N2 bytes traffic
P=softmax(S) storageWritten to HBM (N2)Kept in SRAM (transient)Eliminates another 2N2 bytes
Online softmaxNot applicable; full row reduction over materialized SStreaming max+sum per tileAdds O(N) state traffic, negligible vs. N2
O accumulationGEMM over full P and VTile-wise accumulation in registersReuses loaded V tiles across rows

The tiling strategy restructures the computation from:

Load Q,KCompute SStore SStandard: O(N2) HBM writes

to:

Load Qi,KjCompute SijSoftmax partialAccumulate OiAll in SRAM; only Oi written to HBM

This is a classic loop fusion + cache blocking transformation. The arithmetic intensity rises because the same bytes of Qi, Kj, and Vj now contribute to many more FLOPs before eviction.


5. Measured Bandwidth Utilization

The following table reports effective HBM bandwidth measured via Nsight Compute (dram__bytes.sum.per_second) for the CuFlash-Attn forward+backward kernel on different GPUs and sequence lengths.
Batch=8, heads=16, d=64, causal FP16.

GPUseq_len=1Kseq_len=4Kseq_len=8Kseq_len=16Kseq_len=32KPeak BW% of Peak
V100TBD620 GB/s (est.)710 GB/s (est.)760 GB/s (est.)TBD900 GB/s~84 % (est.)
A1001,120 GB/s1,580 GB/s1,720 GB/s1,890 GB/s1,950 GB/s2,039 GB/s~96 %
H100TBDTBD2,980 GB/s (est.)3,180 GB/s (est.)TBD3,350 GB/s~95 % (est.)

Interpretation: At long sequence lengths the kernel approaches the memory-bandwidth roof. The small-seq dropoff (1K) is due to fixed launch overhead and insufficient threadblocks to saturate all SMs. V100 estimates are derived from A100 measurements scaled by SM count and memory bandwidth ratios.


6. Roofline Position Diagram

Below is a text-based roofline plot for the NVIDIA A100.
The x-axis is arithmetic intensity (FLOP/Byte, log scale); the y-axis is performance (TFLOPS, log scale).

Performance (TFLOPS)
    |
312 |============================================  <- Compute roof (flat)
    |                                          /
    |                                        /
200 |                                      /
    |                                    /
100 |                                  /
    |                                /
 50 |                              /
    |                            /
 20 |                          /
    |                        /  <- Memory bandwidth roof (slope = 2039 GB/s)
 10 |                      /
    |                    /
  5 |                  /                      (*) H100 ridge (295)
    |                /
  2 |              /         (*) A100 ridge (153)
    |            /         /
  1 |          /       /   (*) V100 ridge (139)
    |        /       /
0.5 |      /       /               [Std Attn, d=64]  AI ≈ 32
    |    /       /                     |
0.2 |  /       /                       v
    |/       /    [FlashAttn d=64]  AI ≈ 70
0.1 +----------------------------------------------
      1    10    32   64  100  153  200  300  500  1000
                    Arithmetic Intensity (FLOP/Byte)

Legend:
  [Std Attn d=64]   Standard materializing attention, AI ≈ 32
  [FlashAttn d=64]  CuFlash-Attn tiled, AI ≈ 60–90 (shifts right)
  (*)               Ridge points for V100, A100, H100

Reading the Diagram

  • Standard Attention sits at AI32, well left of all ridge points. Its attainable performance is β×32, i.e.:

    • A100: 2.039×3265 TFLOPS
    • This is only 21 % of peak FP16 compute
  • FlashAttention shifts to AI60--90:

    • A100 at AI=80: 2.039×80163 TFLOPS
    • This is 52 % of peak FP16—still memory-bound, but 2.5× faster than standard for the same workload because the memory roof itself is the limit, and we have halved the bytes moved.
  • Neither kernel crosses the A100 ridge (153). Even at d=128, FlashAttention merely approaches the knee; it does not enter the flat compute-bound region without additional algorithmic changes (e.g. block-sparse patterns, grouped-query attention with heavy K,V reuse).


7. Standard Attention vs. FlashAttention

PropertyStandard (Materializing)FlashAttention (Tiled)Winner & Margin
HBM traffic (fwd+bwd)Θ(N2)Θ(Nd)Flash: ~N/d reduction
Arithmetic intensityd/2d×(reuse factor)Flash: 2–3× higher
Memory bound?Yes, stronglyYes, but less severelyFlash: closer to ridge
Attainable % of peak (A100)~20 %~50 %Flash: 2.5× higher throughput
SRAM pressureLow (naive)High (tile scheduling critical)Standard: simpler
Numerical stabilityFull-row softmax (stable)Online softmax (equivalent)Tie

7.1 Why "Still Memory-Bound" Is a Win

FlashAttention does not magically make attention compute-bound; the O(N2d) FLOP count is intrinsic. What it does is remove HBM round-trips for the N2 activations. In the roofline model, this is equivalent to sliding the operating point to the right along the memory roof:

SpeedupAIflashAIstd=O(d)reuseO(d)no reuse2--3

Because both points lie on the same slanted memory roof, the speedup is bounded by the ratio of arithmetic intensities, not by compute peak. For very long sequences (Nd), this ratio stabilizes and speedups plateau around 2.5–3.0×—exactly what we observe in the Benchmarks.


8. Backward Pass Arithmetic Intensity

The backward pass recomputes S and P on the fly (the "recomputation" trick) rather than storing them from the forward pass. This keeps the memory footprint O(Nd) but adds extra FLOPs:

  • Reload Q,K,V,O,dO
  • Recompute S and P tiles
  • Compute dQ,dK,dV via chain rule

Total backward FLOPs 5N2d (causal), HBM traffic 8Nd.

AIbwd=5N2d8Nd=5N8

Again, accounting for tile streaming reloads, the effective AI is:

AIbwdeffective45--70FLOP/Byte

This is slightly lower than the forward pass because more tensors (dQ,dK,dV,dO) must be staged through HBM. Empirically, the backward pass achieves ~85 % of the forward-pass bandwidth utilization.


9. Head-Dimension Scaling on the Roofline

Because AI=O(d), increasing d is the most direct way to move rightward on the roofline:

head_dimAIeffective (fwd)A100 Attainable (TFLOPS)% PeakRegime
32~357123 %Deep memory-bound
64~7014346 %Memory-bound
128~14028591 %Approaching ridge

At d=128 and large N, CuFlash-Attn would flirt with the A100 ridge point. This is why the official FlashAttention-2 paper reports highest efficiency at d=128 and recommends it for throughput-critical deployments.

CuFlash-Attn note: Our current kernel tile sizes are optimized for d=64. Supporting d=128 efficiently requires doubling shared-memory buffers and adjusting Tensor Core MMA shapes. This is tracked as a future optimization.


10. Summary

  1. FlashAttention is fundamentally memory-bound because AI=O(d) and d is small (32–128).
  2. Tiling raises arithmetic intensity by eliminating N2 HBM round-trips for intermediate S and P matrices.
  3. On A100, CuFlash-Attn achieves ~50 % of theoretical FP16 peak—not because compute is wasted, but because the memory roof caps performance at ~163 TFLOPS for AI80.
  4. The speedup over standard attention (~2.5×) is the ratio of arithmetic intensities, not a compute acceleration.
  5. GPUs with higher bandwidth (H100) or larger SRAM (future architectures) will benefit disproportionately because the kernel is already bandwidth-limited.

For raw latency and speedup numbers, see Benchmarks.
For kernel-level profiling methodology, see the Nsight Compute integration in scripts/profile_roofline.py.

Stable v0.3.0 baseline. Lean CUDA FlashAttention reference.