Architecture Overview
This document describes the high-level architecture of TensorCraft-HPC.
Design Philosophy
TensorCraft-HPC follows three core principles:
- Readability First — Code is meant to be read. Each kernel shows the optimization progression.
- Header-Only — Zero build complexity for C++ users. Just include and go.
- OpenSpec-Driven — Specifications in
openspec/specs/are the source of truth.
System Architecture
Directory Structure
modern-ai-kernels/
├── include/tensorcraft/ # Header-only library
│ ├── core/ # Utilities (error handling, type traits)
│ │ ├── cuda_check.hpp # CUDA error checking macros
│ │ ├── features.hpp # Compile-time GPU feature detection
│ │ ├── type_traits.hpp # Type manipulation utilities
│ │ └── warp_utils.hpp # Warp-level primitives
│ ├── memory/ # Memory management
│ │ ├── tensor.hpp # RAII GPU tensor wrapper
│ │ ├── memory_pool.hpp # Optional memory pooling
│ │ └── aligned_vector.hpp # Cache-aligned vectors
│ └── kernels/ # All compute kernels
│ ├── gemm.hpp # Matrix multiplication
│ ├── attention.hpp # Attention mechanisms
│ ├── normalization.hpp # LayerNorm, RMSNorm, etc.
│ ├── softmax.hpp # Softmax variants
│ ├── conv2d.hpp # 2D convolution
│ ├── sparse.hpp # Sparse operations
│ ├── fusion.hpp # Fused kernels
│ ├── elementwise.hpp # ReLU, GeLU, etc.
│ ├── memory_ops.hpp # Copy, transpose
│ └── quantization.hpp # INT8/FP8 quantization
├── src/python_ops/ # Python bindings (pybind11)
├── tests/ # Unit tests (GoogleTest)
├── benchmarks/ # Performance benchmarks
├── examples/ # Usage examples
├── docs/ # VitePress documentation
└── openspec/ # Specification workflow
├── specs/ # Accepted specifications
├── changes/ # Active change proposals
└── archive/ # Completed changesGEMM Optimization Path
The GEMM kernel demonstrates the progressive optimization approach:
Performance Characteristics
| Stage | Memory Traffic | Compute Efficiency | Relative Speed |
|---|---|---|---|
| Naive | O(N³) global | ~1% | 1x |
| Tiled | O(N²) global | ~10% | 10x |
| Double Buffer | O(N²) global | ~30% | 30x |
| Tensor Core | O(N²) global | ~80% | 80x |
FlashAttention Implementation
Key Innovations
- Tiling — Process attention in tiles that fit in SRAM
- Online Softmax — Update softmax statistics incrementally
- Recomputation — Recompute attention weights instead of storing
Memory Management
RAII Pattern
cpp
// Automatic memory management
{
tensorcraft::FloatTensor A({4096, 4096});
// Use A...
} // Automatically freed when scope exitsMemory Pool (Optional)
Compile-Time Feature Detection
The features.hpp header provides compile-time GPU capability detection:
cpp
// Automatically detected at compile time
#if TENSORCRAFT_HAS_WMMA
// Use Tensor Cores (SM70+)
#endif
#if TENSORCRAFT_HAS_FP8
// Use FP8 types (SM90+)
#endif
#if TENSORCRAFT_HAS_TMA
// Use Tensor Memory Accelerator (SM90+)
#endifOpenSpec Workflow
Specification Structure
Each spec in openspec/specs/ contains:
- Requirements — What the component must do
- Contracts — API guarantees and invariants
- Acceptance Criteria — How to verify compliance
Testing Strategy
| Level | Tool | Purpose |
|---|---|---|
| Unit | GoogleTest | Per-kernel correctness |
| Integration | pytest | Python bindings |
| Benchmark | Google Benchmark | Performance regression |
| Validation | Custom | Numerical accuracy |
Running Tests
bash
# All tests
ctest --preset dev --output-on-failure
# Specific kernel
ctest --preset dev -R gemm
# Benchmarks
./build/benchmarks/gemm_benchmark