Skip to content

Technical Whitepaper

::: abstract Abstract

TensorCraft-HPC is a header-only C++/CUDA library designed for learning high-performance AI kernel implementation. This whitepaper presents the architectural decisions, optimization strategies, and performance analysis that guide the project. Our goal is to demystify GPU kernel development by providing clear, progressive optimization paths from naive implementations to production-grade performance.

Key Results

  • 92% cuBLAS performance on FP16 GEMM with Tensor Core
  • 85% cuDNN performance on FlashAttention
  • Support for NVIDIA SM70-SM100 architectures
  • Zero build complexity via header-only design :::

Executive Summary

Modern AI systems depend critically on high-performance GPU kernels for operations like matrix multiplication, attention, and normalization. However, the path from understanding the math to achieving production-grade performance is often obscured by complexity.

TensorCraft-HPC addresses this gap by:

  1. Explicit Progression: Each kernel evolves through well-defined optimization stages
  2. Educational Clarity: Code is optimized for readability, not just performance
  3. OpenSpec Governance: Specifications drive implementation, ensuring correctness

Project Philosophy

Why This Project Exists

The CUDA ecosystem has excellent production libraries (cuBLAS, cuDNN, CUTLASS), but they are optimized for deployment, not learning. When a developer asks "How do I write an efficient GEMM kernel?", the answer often points to thousands of lines of template metaprogramming.

TensorCraft-HPC provides an alternative: kernels that start simple and evolve, with each optimization step justified and explained.

Design Principles

PrincipleImplication
Readability FirstCode comments explain why, not just what
Progressive ComplexityEach stage is a complete, working kernel
Specification-DrivenOpenSpec files define contracts before implementation
Zero Build FrictionHeader-only for C++, optional pip for Python

Core Contributions

1. Progressive Optimization Framework

Every kernel follows a documented optimization path:

Naive → Tiled → Double Buffer → Tensor Core → Production Parity

Each stage:

  • Is a complete, testable implementation
  • Has clear performance characteristics
  • Demonstrates specific optimization techniques

2. Multi-Architecture Support

Compile-time feature detection enables:

cpp
#if TENSORCRAFT_HAS_WMMA
    // Tensor Core path (SM70+)
#elif TENSORCRAFT_HAS_FP8
    // FP8 path (SM90+)
#else
    // Fallback path
#endif

3. OpenSpec Workflow

Specifications in openspec/specs/ define:

  • Requirements: What the component must do
  • Contracts: API guarantees and invariants
  • Acceptance Criteria: How to verify compliance

Target Audience

This whitepaper is intended for:

  • GPU Kernel Developers seeking to understand optimization techniques
  • ML Infrastructure Engineers evaluating kernel implementations
  • Researchers studying high-performance computing patterns
  • Students learning CUDA programming

Document Structure

SectionContent
ArchitectureSystem design, layering, and extension points
PerformanceBenchmarking methodology and analysis
MethodologyOpenSpec workflow and contribution guidelines

Quick Start

bash
git clone https://github.com/LessUp/modern-ai-kernels.git
cd modern-ai-kernels
cpp
#include "tensorcraft/kernels/gemm.hpp"

tensorcraft::FloatTensor A({4096, 4096});
tensorcraft::FloatTensor B({4096, 4096});
tensorcraft::FloatTensor C({4096, 4096});

tensorcraft::kernels::gemm(A.data(), B.data(), C.data(), 4096, 4096, 4096);
python
import tensorcraft_ops as tc
import numpy as np

A = np.random.randn(4096, 4096).astype(np.float16)
B = np.random.randn(4096, 4096).astype(np.float16)
C = tc.gemm(A, B)  # GPU-accelerated

Citation

If you reference TensorCraft-HPC in academic work:

bibtex
@software{tensorcraft-hpc,
  title = {TensorCraft-HPC: Demystifying High-Performance AI Kernels
           with Modern C++ and CUDA},
  author = {LessUp},
  year = {2024},
  url = {https://github.com/LessUp/modern-ai-kernels}
}

Released under the Apache 2.0 License.