Skip to content

Papers & Citations

This page lists the academic papers and open-source projects that inform the design and implementation of TensorCraft-HPC. We encourage users to read the original papers for deeper understanding.

GEMM Optimization

Foundational Papers

CUTLASS Team (NVIDIA)CUTLASS: CUDA Templates for Linear Algebra Subroutines
https://github.com/NVIDIA/cutlass

The primary reference for Tensor Core programming patterns. TensorCraft-HPC's GEMM implementation follows CUTLASS's tiled and pipeline strategies.

NVIDIAcuBLAS Documentation
https://docs.nvidia.com/cuda/cublas/

The baseline for performance comparison. All GEMM benchmarks report relative performance to cuBLAS.

Tensor Core Programming

NVIDIATensor Core Programming Guide
CUDA C++ Programming Guide

Essential reading for understanding WMMA (Warp Matrix Multiply-Accumulate) operations.


Attention Mechanisms

FlashAttention

Tri Dao, Daniel Y. Fu, et al.FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS 2022
arXiv:2205.14135 | GitHub

The foundational paper on memory-efficient attention. TensorCraft-HPC implements the tiling strategy described in this paper.

Tri DaoFlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
ICLR 2024
arXiv:2307.08691

Improved parallelism strategies for attention computation.

RoPE (Rotary Position Embedding)

Jianlin Su, et al.RoFormer: Enhanced Transformer with Rotary Position Embedding
arXiv:2104.09864

Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. HintonLayer Normalization
arXiv:1607.06450
Biao Zhang, Rico SennrichRoot Mean Square Layer Normalization
NeurIPS 2019
arXiv:1911.12247

RMSNorm is the normalization layer used in LLaMA and many modern LLMs.


Quantization

NVIDIAFP8 Formats for Deep Learning
arXiv:2209.05433

The paper defining the E4M3 and E5M2 FP8 formats used in Hopper architecture.

NVIDIAFP8 Training with NVIDIA Hopper
Transformer Engine Documentation

Sparse Operations

NVIDIAcuSPARSE Documentation
https://docs.nvidia.com/cuda/cusparse/
NVIDIA2:4 Structured Sparsity
CUDA Programming Guide

Ampere architecture supports 2:4 structured sparsity for 2x throughput improvement.


Related Projects

ProjectDescriptionLicense
CUTLASSCUDA Templates for Linear AlgebraBSD-3
FlashAttentionMemory-efficient attentionBSD-3
xFormersFacebook's attention kernelsBSD-3
TritonOpenAI's GPU programming languageMIT
cuDNNNVIDIA Deep Learning libraryProprietary

Citing TensorCraft-HPC

If you use TensorCraft-HPC in your research or teaching materials, please cite:

bibtex
@software{tensorcraft-hpc,
  title = {TensorCraft-HPC: Demystifying High-Performance AI Kernels
           with Modern C++ and CUDA},
  author = {LessUp},
  year = {2024},
  url = {https://github.com/LessUp/modern-ai-kernels},
  note = {Header-only C++/CUDA kernel library for learning}
}

Released under the Apache 2.0 License.