Skip to content

Architecture Overview

This document describes the high-level architecture of TensorCraft-HPC.

Design Philosophy

TensorCraft-HPC follows three core principles:

  1. Readability First — Code is meant to be read. Each kernel shows the optimization progression.
  2. Header-Only — Zero build complexity for C++ users. Just include and go.
  3. OpenSpec-Driven — Specifications in openspec/specs/ are the source of truth.

System Architecture


Directory Structure

modern-ai-kernels/
├── include/tensorcraft/       # Header-only library
│   ├── core/                  # Utilities (error handling, type traits)
│   │   ├── cuda_check.hpp     # CUDA error checking macros
│   │   ├── features.hpp       # Compile-time GPU feature detection
│   │   ├── type_traits.hpp    # Type manipulation utilities
│   │   └── warp_utils.hpp     # Warp-level primitives
│   ├── memory/                # Memory management
│   │   ├── tensor.hpp         # RAII GPU tensor wrapper
│   │   ├── memory_pool.hpp    # Optional memory pooling
│   │   └── aligned_vector.hpp # Cache-aligned vectors
│   └── kernels/               # All compute kernels
│       ├── gemm.hpp           # Matrix multiplication
│       ├── attention.hpp      # Attention mechanisms
│       ├── normalization.hpp  # LayerNorm, RMSNorm, etc.
│       ├── softmax.hpp        # Softmax variants
│       ├── conv2d.hpp         # 2D convolution
│       ├── sparse.hpp         # Sparse operations
│       ├── fusion.hpp         # Fused kernels
│       ├── elementwise.hpp    # ReLU, GeLU, etc.
│       ├── memory_ops.hpp     # Copy, transpose
│       └── quantization.hpp   # INT8/FP8 quantization
├── src/python_ops/            # Python bindings (pybind11)
├── tests/                     # Unit tests (GoogleTest)
├── benchmarks/                # Performance benchmarks
├── examples/                  # Usage examples
├── docs/                      # VitePress documentation
└── openspec/                  # Specification workflow
    ├── specs/                 # Accepted specifications
    ├── changes/               # Active change proposals
    └── archive/               # Completed changes

GEMM Optimization Path

The GEMM kernel demonstrates the progressive optimization approach:

Performance Characteristics

StageMemory TrafficCompute EfficiencyRelative Speed
NaiveO(N³) global~1%1x
TiledO(N²) global~10%10x
Double BufferO(N²) global~30%30x
Tensor CoreO(N²) global~80%80x

FlashAttention Implementation

Key Innovations

  1. Tiling — Process attention in tiles that fit in SRAM
  2. Online Softmax — Update softmax statistics incrementally
  3. Recomputation — Recompute attention weights instead of storing

Memory Management

RAII Pattern

cpp
// Automatic memory management
{
    tensorcraft::FloatTensor A({4096, 4096});
    // Use A...
} // Automatically freed when scope exits

Memory Pool (Optional)


Compile-Time Feature Detection

The features.hpp header provides compile-time GPU capability detection:

cpp
// Automatically detected at compile time
#if TENSORCRAFT_HAS_WMMA
    // Use Tensor Cores (SM70+)
#endif

#if TENSORCRAFT_HAS_FP8
    // Use FP8 types (SM90+)
#endif

#if TENSORCRAFT_HAS_TMA
    // Use Tensor Memory Accelerator (SM90+)
#endif

OpenSpec Workflow

Specification Structure

Each spec in openspec/specs/ contains:

  • Requirements — What the component must do
  • Contracts — API guarantees and invariants
  • Acceptance Criteria — How to verify compliance

Testing Strategy

LevelToolPurpose
UnitGoogleTestPer-kernel correctness
IntegrationpytestPython bindings
BenchmarkGoogle BenchmarkPerformance regression
ValidationCustomNumerical accuracy

Running Tests

bash
# All tests
ctest --preset dev --output-on-failure

# Specific kernel
ctest --preset dev -R gemm

# Benchmarks
./build/benchmarks/gemm_benchmark

Released under the Apache 2.0 License.