Technical Whitepaper

::: abstract Abstract

TensorCraft-HPC is a header-only C++/CUDA library designed for learning high-performance AI kernel implementation. This whitepaper presents the architectural decisions, optimization strategies, and performance analysis that guide the project. Our goal is to demystify GPU kernel development by providing clear, progressive optimization paths from naive implementations to production-grade performance.

Key Results

92% cuBLAS performance on FP16 GEMM with Tensor Core
85% cuDNN performance on FlashAttention
Support for NVIDIA SM70-SM100 architectures
Zero build complexity via header-only design :::

Executive Summary

Modern AI systems depend critically on high-performance GPU kernels for operations like matrix multiplication, attention, and normalization. However, the path from understanding the math to achieving production-grade performance is often obscured by complexity.

TensorCraft-HPC addresses this gap by:

Explicit Progression: Each kernel evolves through well-defined optimization stages
Educational Clarity: Code is optimized for readability, not just performance
OpenSpec Governance: Specifications drive implementation, ensuring correctness

Project Philosophy

Why This Project Exists

The CUDA ecosystem has excellent production libraries (cuBLAS, cuDNN, CUTLASS), but they are optimized for deployment, not learning. When a developer asks "How do I write an efficient GEMM kernel?", the answer often points to thousands of lines of template metaprogramming.

TensorCraft-HPC provides an alternative: kernels that start simple and evolve, with each optimization step justified and explained.

Design Principles

Principle	Implication
Readability First	Code comments explain why, not just what
Progressive Complexity	Each stage is a complete, working kernel
Specification-Driven	OpenSpec files define contracts before implementation
Zero Build Friction	Header-only for C++, optional pip for Python

Core Contributions

1. Progressive Optimization Framework

Every kernel follows a documented optimization path:

Naive → Tiled → Double Buffer → Tensor Core → Production Parity

Each stage:

Is a complete, testable implementation
Has clear performance characteristics
Demonstrates specific optimization techniques

2. Multi-Architecture Support

Compile-time feature detection enables:

cpp

#if TENSORCRAFT_HAS_WMMA
    // Tensor Core path (SM70+)
#elif TENSORCRAFT_HAS_FP8
    // FP8 path (SM90+)
#else
    // Fallback path
#endif

3. OpenSpec Workflow

Specifications in openspec/specs/ define:

Requirements: What the component must do
Contracts: API guarantees and invariants
Acceptance Criteria: How to verify compliance

Target Audience

This whitepaper is intended for:

GPU Kernel Developers seeking to understand optimization techniques
ML Infrastructure Engineers evaluating kernel implementations
Researchers studying high-performance computing patterns
Students learning CUDA programming

Document Structure

Section	Content
Architecture	System design, layering, and extension points
Performance	Benchmarking methodology and analysis
Methodology	OpenSpec workflow and contribution guidelines

Quick Start

CloneC++Python

bash

git clone https://github.com/LessUp/modern-ai-kernels.git
cd modern-ai-kernels

cpp

#include "tensorcraft/kernels/gemm.hpp"

tensorcraft::FloatTensor A({4096, 4096});
tensorcraft::FloatTensor B({4096, 4096});
tensorcraft::FloatTensor C({4096, 4096});

tensorcraft::kernels::gemm(A.data(), B.data(), C.data(), 4096, 4096, 4096);

python

import tensorcraft_ops as tc
import numpy as np

A = np.random.randn(4096, 4096).astype(np.float16)
B = np.random.randn(4096, 4096).astype(np.float16)
C = tc.gemm(A, B)  # GPU-accelerated

Citation

If you reference TensorCraft-HPC in academic work:

bibtex

@software{tensorcraft-hpc,
  title = {TensorCraft-HPC: Demystifying High-Performance AI Kernels
           with Modern C++ and CUDA},
  author = {LessUp},
  year = {2024},
  url = {https://github.com/LessUp/modern-ai-kernels}
}

Technical Whitepaper ​

Executive Summary ​

Project Philosophy ​

Why This Project Exists ​

Design Principles ​

Core Contributions ​

1. Progressive Optimization Framework ​

2. Multi-Architecture Support ​

3. OpenSpec Workflow ​

Target Audience ​

Document Structure ​

Quick Start ​

Citation ​