TensorCraft-HPCHigh-Performance AI Kernels

A Technical Whitepaper on Modern GPU Computing — From Volta to Blackwell

🎓

Educational Design

Progressive optimization paths from naive to Tensor Core, with clear annotations at every step.

⚡

FP16 GEMM achieves industry-standard performance on A100 with full Tensor Core utilization.

🔧

Zero build complexity — simply include headers. Optional Python bindings via pip install.

🖥️

Compile-time feature detection for SM70-SM100, covering Volta through Blackwell.