Papers & Citations

This page lists the academic papers and open-source projects that inform the design and implementation of TensorCraft-HPC. We encourage users to read the original papers for deeper understanding.

GEMM Optimization

Foundational Papers

CUTLASS Team (NVIDIA) — CUTLASS: CUDA Templates for Linear Algebra Subroutines
https://github.com/NVIDIA/cutlass

The primary reference for Tensor Core programming patterns. TensorCraft-HPC's GEMM implementation follows CUTLASS's tiled and pipeline strategies.

NVIDIA — cuBLAS Documentation
https://docs.nvidia.com/cuda/cublas/

The baseline for performance comparison. All GEMM benchmarks report relative performance to cuBLAS.

Tensor Core Programming

NVIDIA — Tensor Core Programming Guide
CUDA C++ Programming Guide

Essential reading for understanding WMMA (Warp Matrix Multiply-Accumulate) operations.

Attention Mechanisms

FlashAttention

Tri Dao, Daniel Y. Fu, et al. — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS 2022
arXiv:2205.14135 | GitHub

The foundational paper on memory-efficient attention. TensorCraft-HPC implements the tiling strategy described in this paper.

Tri Dao — FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
ICLR 2024
arXiv:2307.08691

Improved parallelism strategies for attention computation.

RoPE (Rotary Position Embedding)

Jianlin Su, et al. — RoFormer: Enhanced Transformer with Rotary Position Embedding
arXiv:2104.09864

Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton — Layer Normalization
arXiv:1607.06450

Biao Zhang, Rico Sennrich — Root Mean Square Layer Normalization
NeurIPS 2019
arXiv:1911.12247

RMSNorm is the normalization layer used in LLaMA and many modern LLMs.

Quantization

NVIDIA — FP8 Formats for Deep Learning
arXiv:2209.05433

The paper defining the E4M3 and E5M2 FP8 formats used in Hopper architecture.

NVIDIA — FP8 Training with NVIDIA Hopper
Transformer Engine Documentation

Sparse Operations

NVIDIA — cuSPARSE Documentation
https://docs.nvidia.com/cuda/cusparse/

NVIDIA — 2:4 Structured Sparsity
CUDA Programming Guide

Ampere architecture supports 2:4 structured sparsity for 2x throughput improvement.

Related Projects

Project	Description	License
CUTLASS	CUDA Templates for Linear Algebra	BSD-3
FlashAttention	Memory-efficient attention	BSD-3
xFormers	Facebook's attention kernels	BSD-3
Triton	OpenAI's GPU programming language	MIT
cuDNN	NVIDIA Deep Learning library	Proprietary

Citing TensorCraft-HPC

If you use TensorCraft-HPC in your research or teaching materials, please cite:

bibtex

@software{tensorcraft-hpc,
  title = {TensorCraft-HPC: Demystifying High-Performance AI Kernels
           with Modern C++ and CUDA},
  author = {LessUp},
  year = {2024},
  url = {https://github.com/LessUp/modern-ai-kernels},
  note = {Header-only C++/CUDA kernel library for learning}
}

Papers & Citations ​

GEMM Optimization ​

Foundational Papers ​

Tensor Core Programming ​

Attention Mechanisms ​

FlashAttention ​

RoPE (Rotary Position Embedding) ​

Normalization ​

Quantization ​

Sparse Operations ​

Related Projects ​

Citing TensorCraft-HPC ​