Architecture Overview

This document describes the high-level architecture of TensorCraft-HPC.

Design Philosophy

TensorCraft-HPC follows three core principles:

Readability First — Code is meant to be read. Each kernel shows the optimization progression.
Header-Only — Zero build complexity for C++ users. Just include and go.
OpenSpec-Driven — Specifications in openspec/specs/ are the source of truth.

System Architecture

Directory Structure

modern-ai-kernels/
├── include/tensorcraft/       # Header-only library
│   ├── core/                  # Utilities (error handling, type traits)
│   │   ├── cuda_check.hpp     # CUDA error checking macros
│   │   ├── features.hpp       # Compile-time GPU feature detection
│   │   ├── type_traits.hpp    # Type manipulation utilities
│   │   └── warp_utils.hpp     # Warp-level primitives
│   ├── memory/                # Memory management
│   │   ├── tensor.hpp         # RAII GPU tensor wrapper
│   │   ├── memory_pool.hpp    # Optional memory pooling
│   │   └── aligned_vector.hpp # Cache-aligned vectors
│   └── kernels/               # All compute kernels
│       ├── gemm.hpp           # Matrix multiplication
│       ├── attention.hpp      # Attention mechanisms
│       ├── normalization.hpp  # LayerNorm, RMSNorm, etc.
│       ├── softmax.hpp        # Softmax variants
│       ├── conv2d.hpp         # 2D convolution
│       ├── sparse.hpp         # Sparse operations
│       ├── fusion.hpp         # Fused kernels
│       ├── elementwise.hpp    # ReLU, GeLU, etc.
│       ├── memory_ops.hpp     # Copy, transpose
│       └── quantization.hpp   # INT8/FP8 quantization
├── src/python_ops/            # Python bindings (pybind11)
├── tests/                     # Unit tests (GoogleTest)
├── benchmarks/                # Performance benchmarks
├── examples/                  # Usage examples
├── docs/                      # VitePress documentation
└── openspec/                  # Specification workflow
    ├── specs/                 # Accepted specifications
    ├── changes/               # Active change proposals
    └── archive/               # Completed changes

GEMM Optimization Path

The GEMM kernel demonstrates the progressive optimization approach:

Performance Characteristics

Stage	Memory Traffic	Compute Efficiency	Relative Speed
Naive	O(N³) global	~1%	1x
Tiled	O(N²) global	~10%	10x
Double Buffer	O(N²) global	~30%	30x
Tensor Core	O(N²) global	~80%	80x

FlashAttention Implementation

Key Innovations

Tiling — Process attention in tiles that fit in SRAM
Online Softmax — Update softmax statistics incrementally
Recomputation — Recompute attention weights instead of storing

Memory Management

RAII Pattern

cpp

// Automatic memory management
{
    tensorcraft::FloatTensor A({4096, 4096});
    // Use A...
} // Automatically freed when scope exits

Memory Pool (Optional)

Compile-Time Feature Detection

The features.hpp header provides compile-time GPU capability detection:

cpp

// Automatically detected at compile time
#if TENSORCRAFT_HAS_WMMA
    // Use Tensor Cores (SM70+)
#endif

#if TENSORCRAFT_HAS_FP8
    // Use FP8 types (SM90+)
#endif

#if TENSORCRAFT_HAS_TMA
    // Use Tensor Memory Accelerator (SM90+)
#endif

OpenSpec Workflow

Specification Structure

Each spec in openspec/specs/ contains:

Requirements — What the component must do
Contracts — API guarantees and invariants
Acceptance Criteria — How to verify compliance

Testing Strategy

Level	Tool	Purpose
Unit	GoogleTest	Per-kernel correctness
Integration	pytest	Python bindings
Benchmark	Google Benchmark	Performance regression
Validation	Custom	Numerical accuracy

Running Tests

bash

# All tests
ctest --preset dev --output-on-failure

# Specific kernel
ctest --preset dev -R gemm

# Benchmarks
./build/benchmarks/gemm_benchmark

Architecture Overview ​

Design Philosophy ​

System Architecture ​

Directory Structure ​

GEMM Optimization Path ​

Performance Characteristics ​

FlashAttention Implementation ​

Key Innovations ​

Memory Management ​

RAII Pattern ​

Memory Pool (Optional) ​

Compile-Time Feature Detection ​

OpenSpec Workflow ​

Specification Structure ​

Testing Strategy ​

Running Tests ​