TensorCraft-HPC Architecture Design
This document describes the overall architecture and design decisions of TensorCraft-HPC.
Design Principles
1. Header-Only Design
The core library adopts a pure header-only design with the following benefits:
- Zero-configuration integration: Simply
#includeto use - Compile-time optimization: Template code can be fully inlined
- Cross-platform compatibility: No pre-compiled libraries needed
// Usage example
#include "tensorcraft/kernels/gemm.hpp"
tensorcraft::kernels::gemm(A, B, C, M, N, K);2. Progressive Optimization
Each operator provides multiple optimization levels for learning and comparison:
Naive → Tiled → Double Buffer → Tensor Core
↓ ↓ ↓ ↓
Basic Shared Latency Hardware
Memory Hiding Acceleration3. Modern C++ First
Fully leverages C++17/20/23 features:
- Concepts (C++20): Type constraints
- constexpr if (C++17): Compile-time branching
- Structured Bindings (C++17): Multiple return values
- RAII: Automatic resource management
4. Backward Compatibility
Supports CUDA 11.0-13.1 with conditional compilation for new features:
#if TC_CUDA_VERSION >= 12000
// FP8 support
#endif
#if TC_CUDA_VERSION >= 11080
// WGMMA support
#endifModule Architecture
tensorcraft/
├── core/ # Core utilities layer
├── memory/ # Memory management layer
└── kernels/ # Operator implementation layerCore Layer
Provides fundamental infrastructure:
core/
├── cuda_check.hpp # Error checking
├── features.hpp # Feature detection
└── type_traits.hpp # Type systemcuda_check.hpp: CUDA error checking macros
#define TC_CUDA_CHECK(err) do { \
cudaError_t e = (err); \
if (e != cudaSuccess) { \
throw std::runtime_error(cudaGetErrorString(e)); \
} \
} while(0)features.hpp: Compile-time feature detection
// C++ version detection
#if __cplusplus >= 202002L
#define TC_CPP20 1
#endif
// CUDA feature detection
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 700
#define TC_HAS_WMMA 1
#endiftype_traits.hpp: Type traits and Concepts
template<typename T>
concept Numeric = std::is_arithmetic_v<T> || is_half_v<T>;Memory Layer
Provides memory abstractions:
memory/
├── aligned_vector.hpp # Vectorized loading
├── tensor.hpp # Tensor wrapper
└── memory_pool.hpp # Memory poolAlignedVector: Supports vectorized memory access
template<typename T, int N>
struct alignas(sizeof(T) * N) AlignedVector {
T val[N];
};
// 128-bit load
using Vec4 = AlignedVector<float, 4>;
Vec4 data = *reinterpret_cast<const Vec4*>(&input[idx]);Tensor: RAII-style GPU tensor
template<typename T>
class Tensor {
T* data_ = nullptr;
std::vector<size_t> shape_;
public:
Tensor(const std::vector<size_t>& shape);
~Tensor() { if (data_) cudaFree(data_); }
// Move-only
Tensor(Tensor&&) noexcept;
Tensor(const Tensor&) = delete;
};MemoryPool: Reduces allocation overhead
class MemoryPool {
std::map<size_t, std::vector<void*>> free_blocks_;
public:
void* allocate(size_t bytes);
void deallocate(void* ptr);
};Kernels Layer
Operator implementations:
kernels/
├── elementwise.hpp # Element-wise operations
├── softmax.hpp # Softmax
├── normalization.hpp # Normalization
├── gemm.hpp # Matrix multiplication
├── attention.hpp # Attention mechanism
├── conv2d.hpp # Convolution
├── sparse.hpp # Sparse matrices
└── fusion.hpp # Fusion and quantizationKernel Design Patterns
1. Functor Pattern
Using Functors for composable operations:
struct ReLU {
template<typename T>
__device__ __forceinline__ T operator()(T x) const {
return x > T(0) ? x : T(0);
}
};
template<typename T, typename Func>
__global__ void elementwise_kernel(const T* in, T* out, size_t n, Func func) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) out[idx] = func(in[idx]);
}2. Version Selection Pattern
Selecting optimization level via enumeration:
enum class GemmVersion { Naive, Tiled, DoubleBuffer };
template<typename T>
void launch_gemm(const T* A, const T* B, T* C, int M, int N, int K,
GemmVersion version) {
switch (version) {
case GemmVersion::Naive:
gemm_naive<<<grid, block>>>(A, B, C, M, N, K);
break;
case GemmVersion::Tiled:
gemm_tiled<<<grid, block, smem>>>(A, B, C, M, N, K);
break;
case GemmVersion::DoubleBuffer:
gemm_double_buffer<<<grid, block, smem>>>(A, B, C, M, N, K);
break;
}
}Tensor Core users should call launch_gemm_wmma() directly instead of selecting a GemmVersion variant.
3. Epilogue Pattern
Extensible design for GEMM post-processing:
struct EpilogueBiasReLU {
const float* bias;
__device__ __forceinline__ float operator()(float acc, int col) const {
float result = acc + bias[col];
return result > 0.0f ? result : 0.0f;
}
};
template<typename T, typename Epilogue>
__global__ void gemm_with_epilogue(/* ... */, Epilogue epilogue) {
// ... GEMM computation ...
C[idx] = epilogue(acc, col);
}4. Compile-time Configuration
Using template parameters for compile-time configuration:
template<typename T, int TILE_M = 128, int TILE_N = 128, int TILE_K = 32>
__global__ void gemm_optimized(/* ... */) {
__shared__ T As[TILE_M][TILE_K];
__shared__ T Bs[TILE_K][TILE_N];
// ...
}Memory Access Optimization
Vectorized Loading
// Scalar load: 4 memory transactions
float a = input[idx];
float b = input[idx+1];
float c = input[idx+2];
float d = input[idx+3];
// Vectorized load: 1 memory transaction
float4 vec = *reinterpret_cast<const float4*>(&input[idx]);Coalesced Access
// Good: Adjacent threads access adjacent memory
output[threadIdx.x] = input[threadIdx.x];
// Bad: Strided access
output[threadIdx.x * stride] = input[threadIdx.x * stride];Bank Conflict Avoidance
// With bank conflict
__shared__ float tile[32][32];
// Without bank conflict (add padding)
__shared__ float tile[32][33];Numerical Stability
Softmax Stability
// Unstable: exp may overflow
for (int i = 0; i < n; ++i)
output[i] = exp(input[i]) / sum;
// Stable: subtract maximum value
float max_val = *max_element(input, input + n);
for (int i = 0; i < n; ++i)
output[i] = exp(input[i] - max_val) / sum;LayerNorm Stability
// Welford's algorithm: Single-pass mean and variance computation
float mean = 0.0f, M2 = 0.0f;
for (int i = 0; i < n; ++i) {
float delta = x[i] - mean;
mean += delta / (i + 1);
M2 += delta * (x[i] - mean);
}
float var = M2 / n;Extension Guide
Adding New Operators
- Create header file in
include/tensorcraft/kernels/ - Implement kernel function and launcher
- Add tests to
tests/ - Add benchmarks to
benchmarks/ - Update documentation
Adding New Optimization Levels
- Add new kernel implementation in existing header
- Update version enumeration
- Add branch in launcher
- Add comparative tests
Supporting New Data Types
- Add type detection in
type_traits.hpp - Add feature detection in
features.hpp - Specialize related kernel templates
- Add type conversion utilities
Performance Considerations
Occupancy vs Registers
// High occupancy: More parallelism
__launch_bounds__(256, 4) // 256 threads, 4 blocks/SM
// More registers: Less spilling
__launch_bounds__(128, 2) // 128 threads, 2 blocks/SMShared Memory vs L1 Cache
// Prefer shared memory
cudaFuncSetAttribute(kernel,
cudaFuncAttributePreferredSharedMemoryCarveout, 100);
// Prefer L1 Cache
cudaFuncSetAttribute(kernel,
cudaFuncAttributePreferredSharedMemoryCarveout, 0);Asynchronous Execution
// Use multiple streams for overlapped execution
cudaStream_t streams[4];
for (int i = 0; i < 4; ++i) {
cudaStreamCreate(&streams[i]);
kernel<<<grid, block, 0, streams[i]>>>(data[i]);
}