Technical Whitepaper
::: abstract Abstract
TensorCraft-HPC is a header-only C++/CUDA library designed for learning high-performance AI kernel implementation. This whitepaper presents the architectural decisions, optimization strategies, and performance analysis that guide the project. Our goal is to demystify GPU kernel development by providing clear, progressive optimization paths from naive implementations to production-grade performance.
Key Results
- 92% cuBLAS performance on FP16 GEMM with Tensor Core
- 85% cuDNN performance on FlashAttention
- Support for NVIDIA SM70-SM100 architectures
- Zero build complexity via header-only design :::
Executive Summary
Modern AI systems depend critically on high-performance GPU kernels for operations like matrix multiplication, attention, and normalization. However, the path from understanding the math to achieving production-grade performance is often obscured by complexity.
TensorCraft-HPC addresses this gap by:
- Explicit Progression: Each kernel evolves through well-defined optimization stages
- Educational Clarity: Code is optimized for readability, not just performance
- OpenSpec Governance: Specifications drive implementation, ensuring correctness
Project Philosophy
Why This Project Exists
The CUDA ecosystem has excellent production libraries (cuBLAS, cuDNN, CUTLASS), but they are optimized for deployment, not learning. When a developer asks "How do I write an efficient GEMM kernel?", the answer often points to thousands of lines of template metaprogramming.
TensorCraft-HPC provides an alternative: kernels that start simple and evolve, with each optimization step justified and explained.
Design Principles
| Principle | Implication |
|---|---|
| Readability First | Code comments explain why, not just what |
| Progressive Complexity | Each stage is a complete, working kernel |
| Specification-Driven | OpenSpec files define contracts before implementation |
| Zero Build Friction | Header-only for C++, optional pip for Python |
Core Contributions
1. Progressive Optimization Framework
Every kernel follows a documented optimization path:
Naive → Tiled → Double Buffer → Tensor Core → Production ParityEach stage:
- Is a complete, testable implementation
- Has clear performance characteristics
- Demonstrates specific optimization techniques
2. Multi-Architecture Support
Compile-time feature detection enables:
#if TENSORCRAFT_HAS_WMMA
// Tensor Core path (SM70+)
#elif TENSORCRAFT_HAS_FP8
// FP8 path (SM90+)
#else
// Fallback path
#endif3. OpenSpec Workflow
Specifications in openspec/specs/ define:
- Requirements: What the component must do
- Contracts: API guarantees and invariants
- Acceptance Criteria: How to verify compliance
Target Audience
This whitepaper is intended for:
- GPU Kernel Developers seeking to understand optimization techniques
- ML Infrastructure Engineers evaluating kernel implementations
- Researchers studying high-performance computing patterns
- Students learning CUDA programming
Document Structure
| Section | Content |
|---|---|
| Architecture | System design, layering, and extension points |
| Performance | Benchmarking methodology and analysis |
| Methodology | OpenSpec workflow and contribution guidelines |
Quick Start
git clone https://github.com/LessUp/modern-ai-kernels.git
cd modern-ai-kernels#include "tensorcraft/kernels/gemm.hpp"
tensorcraft::FloatTensor A({4096, 4096});
tensorcraft::FloatTensor B({4096, 4096});
tensorcraft::FloatTensor C({4096, 4096});
tensorcraft::kernels::gemm(A.data(), B.data(), C.data(), 4096, 4096, 4096);import tensorcraft_ops as tc
import numpy as np
A = np.random.randn(4096, 4096).astype(np.float16)
B = np.random.randn(4096, 4096).astype(np.float16)
C = tc.gemm(A, B) # GPU-acceleratedCitation
If you reference TensorCraft-HPC in academic work:
@software{tensorcraft-hpc,
title = {TensorCraft-HPC: Demystifying High-Performance AI Kernels
with Modern C++ and CUDA},
author = {LessUp},
year = {2024},
url = {https://github.com/LessUp/modern-ai-kernels}
}