Skip to content

References

This page collects all bibliographic entries cited throughout the CuFlash-Attn documentation. Entries are grouped by category and formatted in a BibTeX-inspired style for easy inclusion in academic manuscripts, technical reports, and interview preparation materials.


Table of Contents


Core Papers

These papers define the algorithmic and architectural foundations on which CuFlash-Attn is built.


FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Citation:

bibtex
@inproceedings{dao2022flashattention,
  title={FlashAttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
  author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022},
  url={https://arxiv.org/abs/2205.14135}
}

Formatted entry:

  • Title: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
  • Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher R'e
  • Venue: NeurIPS 2022
  • Year: 2022
  • URL: https://arxiv.org/abs/2205.14135

Used in: Algorithm, Kernel Deep Dive, Related Work


FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Citation:

bibtex
@inproceedings{dao2024flashattention2,
  title={FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning},
  author={Dao, Tri},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024},
  url={https://arxiv.org/abs/2307.08691}
}

Formatted entry:

  • Title: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
  • Authors: Tri Dao
  • Venue: ICLR 2024
  • Year: 2024
  • URL: https://arxiv.org/abs/2307.08691

Used in: Related Work


Online normalizer calculation for softmax

Citation:

bibtex
@article{milakov2018online,
  title={Online normalizer calculation for softmax},
  author={Milakov, Maxim and Gimelshein, Natalia},
  journal={arXiv preprint arXiv:1805.02867},
  year={2018},
  url={https://arxiv.org/abs/1805.02867}
}

Formatted entry:

Used in: Algorithm, Kernel Deep Dive, Related Work


Implementation References

These papers and reports describe systems, models, and architectural patterns that contextualize the practical need for memory-efficient attention kernels.


Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Citation:

bibtex
@inproceedings{narayanan2021megatron,
  title={Efficient Large-Scale Language Model Training on {GPU} Clusters Using {Megatron-LM}},
  author={Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amir and Zaharia, Matei},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
  year={2021},
  url={https://arxiv.org/abs/2104.04473}
}

Formatted entry:

  • Title: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  • Authors: Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amir Phanishayee, Matei Zaharia
  • Venue: SC 2021
  • Year: 2021
  • URL: https://arxiv.org/abs/2104.04473

Used in: Related Work


PagedAttention: vLLM

Citation:

bibtex
@inproceedings{kwon2023pagedattention,
  title={Efficient Memory Management for Large Language Model Serving with {PagedAttention}},
  author={Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion},
  booktitle={Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)},
  year={2023},
  url={https://arxiv.org/abs/2309.06180}
}

Formatted entry:

  • Title: Efficient Memory Management for Large Language Model Serving with PagedAttention
  • Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
  • Venue: SOSP 2023
  • Year: 2023
  • URL: https://arxiv.org/abs/2309.06180

Used in: Related Work


Ring Attention with Blockwise Transformers for Near-Infinite Context

Citation:

bibtex
@article{liu2023ringattention,
  title={Ring Attention with Blockwise Transformers for Near-Infinite Context},
  author={Liu, Hao and Zaharia, Matei and Abbeel, Pieter},
  journal={arXiv preprint arXiv:2310.01889},
  year={2023},
  url={https://arxiv.org/abs/2310.01889}
}

Formatted entry:

  • Title: Ring Attention with Blockwise Transformers for Near-Infinite Context
  • Authors: Hao Liu, Matei Zaharia, Pieter Abbeel
  • Venue: arXiv preprint
  • Year: 2023
  • URL: https://arxiv.org/abs/2310.01889

Used in: Related Work


Multi-Query Attention

Citation:

bibtex
@article{shazeer2019multiquery,
  title={Fast Transformer Decoding: One Write-Head is All You Need},
  author={Shazeer, Noam},
  journal={arXiv preprint arXiv:1911.02150},
  year={2019},
  url={https://arxiv.org/abs/1911.02150}
}

Formatted entry:

  • Title: Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)
  • Authors: Noam Shazeer
  • Venue: arXiv preprint
  • Year: 2019
  • URL: https://arxiv.org/abs/1911.02150

Used in: Related Work


Grouped-Query Attention

Citation:

bibtex
@article{ainslie2023gqa,
  title={GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints},
  author={Ainslie, Joshua and Lu, Tao and de Jong, Michiel and Zemlyanskiy, Yury and Ontanon, Santiago and Sanghai, Sumit},
  journal={arXiv preprint arXiv:2305.13245},
  year={2023},
  url={https://arxiv.org/abs/2305.13245}
}

Formatted entry:

  • Title: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
  • Authors: Joshua Ainslie, Tao Lu, Michiel de Jong, Yury Zemlyanskiy, Santiago Ontanon, Sumit Sanghai
  • Venue: arXiv preprint
  • Year: 2023
  • URL: https://arxiv.org/abs/2305.13245

Used in: Related Work


Performance Optimization References

These works establish the empirical and theoretical grounding for GPU performance engineering decisions made in CuFlash-Attn.


Roofline Model for GPU Performance Analysis

While CuFlash-Attn does not cite a single canonical roofline paper, the roofline analyses in our Performance documentation are grounded in the methodology established by Williams, Waterman, and Patterson (2009), extended to GPU architectures by subsequent NVIDIA and academic works.

Recommended reading:

  • Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4), 65–76. https://doi.org/10.1145/1498765.1498785

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Citation:

bibtex
@inproceedings{jang2019volta,
  title={Dissecting the {NVIDIA} {Volta} {GPU} Architecture via Microbenchmarking},
  author={Jang, Haicheng and Kim, Wonjeon and Chun, Sungho and Lee, Jungho and Lee, Hyungon and Noh, Seungryul and Jung, Myoungsoo},
  booktitle={2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
  year={2019},
  url={https://arxiv.org/abs/1804.06826}
}

Formatted entry:

  • Title: Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
  • Authors: Haicheng Jang, Wonjeon Kim, Sungho Chun, Jungho Lee, Hyungon Lee, Seungryul Noh, Myoungsoo Jung
  • Venue: ISPASS 2019
  • Year: 2019
  • URL: https://arxiv.org/abs/1804.06826

Note: The shared-memory capacity figures, warp scheduling rules, and L2 cache-line size assumptions used in CuFlash-Attn's kernel design are validated against the microbenchmarking data in this paper and its Ampere/Hopper successors.


CUDA Programming References

These are authoritative NVIDIA documents used to justify device-side programming choices in CuFlash-Attn.


CUDA C++ Programming Guide

Citation:

bibtex
@manual{nvidia2024cudaguide,
  title={{CUDA C++ Programming Guide}},
  author={{NVIDIA Corporation}},
  edition={Version 12.4},
  year={2024},
  url={https://docs.nvidia.com/cuda/cuda-c-programming-guide/}
}

Formatted entry:

Relevant sections:

  • § 7.22 __launch_bounds__
  • § 7.3 Shared Memory
  • § 7.13 Warp Shuffle Functions
  • § 5.2.3 Vectorized Memory Access

NVIDIA Nsight Compute Documentation

Citation:

bibtex
@manual{nvidia2024nsight,
  title={{NVIDIA Nsight Compute: Kernel Profiling Guide}},
  author={{NVIDIA Corporation}},
  edition={Version 2024.1},
  year={2024},
  url={https://docs.nvidia.com/nsight-compute/}
}

Formatted entry:

Relevant counters:

  • gld_transactions, gst_transactions — coalescing verification
  • shared_load_bank_conflict — shared memory conflict detection
  • memory_throughput — bandwidth saturation analysis
  • inst_executed — instruction count reduction for causal-mask skipping

How to Cite CuFlash-Attn

If you use CuFlash-Attn in academic work, software benchmarks, or technical blogging, please cite the project as follows:

bibtex
@software{cuflashattn2024,
  title={{CuFlash-Attn}: From-Scratch {CUDA} {FlashAttention}},
  author={{AICL-Lab}},
  year={2024},
  url={https://github.com/AICL-Lab/cuflash-attn},
  note={Version 0.3.0. Educational CUDA C++ reference implementation of FlashAttention.}
}

Last updated: 2024. This reference list is maintained in sync with the Related Work page. If a paper cited in the documentation is missing from this page, please file a documentation issue.

Stable v0.3.0 baseline. Lean CUDA FlashAttention reference.