🎓Educational DesignProgressive optimization paths from naive to Tensor Core, with clear annotations at every step.
⚡92% cuBLAS PerformanceFP16 GEMM achieves industry-standard performance on A100 with full Tensor Core utilization.
🔧Header-Only ArchitectureZero build complexity — simply include headers. Optional Python bindings via pip install.
🖥️Multi-Architecture SupportCompile-time feature detection for SM70-SM100, covering Volta through Blackwell.