Papers & Citations
This page lists the academic papers and open-source projects that inform the design and implementation of TensorCraft-HPC. We encourage users to read the original papers for deeper understanding.
GEMM Optimization
Foundational Papers
The primary reference for Tensor Core programming patterns. TensorCraft-HPC's GEMM implementation follows CUTLASS's tiled and pipeline strategies.
The baseline for performance comparison. All GEMM benchmarks report relative performance to cuBLAS.
Tensor Core Programming
Essential reading for understanding WMMA (Warp Matrix Multiply-Accumulate) operations.
Attention Mechanisms
FlashAttention
NeurIPS 2022
arXiv:2205.14135 | GitHub
The foundational paper on memory-efficient attention. TensorCraft-HPC implements the tiling strategy described in this paper.
ICLR 2024
arXiv:2307.08691
Improved parallelism strategies for attention computation.
RoPE (Rotary Position Embedding)
Normalization
RMSNorm is the normalization layer used in LLaMA and many modern LLMs.
Quantization
The paper defining the E4M3 and E5M2 FP8 formats used in Hopper architecture.
Sparse Operations
Ampere architecture supports 2:4 structured sparsity for 2x throughput improvement.
Related Projects
| Project | Description | License |
|---|---|---|
| CUTLASS | CUDA Templates for Linear Algebra | BSD-3 |
| FlashAttention | Memory-efficient attention | BSD-3 |
| xFormers | Facebook's attention kernels | BSD-3 |
| Triton | OpenAI's GPU programming language | MIT |
| cuDNN | NVIDIA Deep Learning library | Proprietary |
Citing TensorCraft-HPC
If you use TensorCraft-HPC in your research or teaching materials, please cite:
@software{tensorcraft-hpc,
title = {TensorCraft-HPC: Demystifying High-Performance AI Kernels
with Modern C++ and CUDA},
author = {LessUp},
year = {2024},
url = {https://github.com/LessUp/modern-ai-kernels},
note = {Header-only C++/CUDA kernel library for learning}
}