Frequently Asked Questions
General
What is TensorCraft-HPC?
TensorCraft-HPC is a header-only C++/CUDA library for learning high-performance AI kernel implementation. It provides progressive optimization paths from naive to production-grade performance, with clear annotations at every step.
Who is this project for?
- GPU Kernel Developers seeking to understand optimization techniques
- ML Infrastructure Engineers evaluating kernel implementations
- Researchers studying high-performance computing patterns
- Students learning CUDA programming
How does this differ from CUTLASS or cuBLAS?
| Aspect | TensorCraft-HPC | CUTLASS | cuBLAS |
|---|---|---|---|
| Purpose | Learning | Production | Production |
| Code Style | Readable | Template-heavy | Closed-source |
| Performance | 92% cuBLAS | ~100% | 100% (baseline) |
| Build | Header-only | CMake + builds | Pre-installed |
TensorCraft-HPC prioritizes educational clarity over maximum performance. CUTLASS is the reference for production deployment.
Technical
What GPU architectures are supported?
| Architecture | Compute Capability | Status |
|---|---|---|
| Volta | SM70 (7.0) | Supported |
| Turing | SM75 (7.5) | Supported |
| Ampere | SM80/SM86 | Supported |
| Hopper | SM90 (9.0) | Supported |
| Blackwell | SM100 (10.0) | Supported |
What CUDA versions are supported?
CUDA 11.0 through 13.1. CUDA 12.0+ is recommended for FP8 and Transformer Engine features.
Why header-only?
- Zero build friction: Just
#includeand go - Easy integration: Copy headers into any project
- Transparency: All code is visible and auditable
- No ABI issues: Everything compiles with your flags
Can I use this in production?
TensorCraft-HPC is primarily designed for learning. While the kernels achieve good performance (up to 95% of NVIDIA libraries for some operations), for production deployments we recommend:
- GEMM: Use cuBLAS or CUTLASS
- Attention: Use FlashAttention official
- Normalization: Use cuDNN
What precision types are supported?
| Type | Size | Architecture |
|---|---|---|
| FP32 | 32-bit | All |
| FP16 (half) | 16-bit | SM70+ |
| BF16 | 16-bit | SM80+ |
| TF32 | 19-bit | SM80+ |
| FP8 (E4M3/E5M2) | 8-bit | SM90+ |
| INT8 | 8-bit | SM75+ |
Building & Integration
How do I add TensorCraft-HPC to my CMake project?
# Simple: just add the include directory
target_include_directories(your_target PRIVATE
path/to/modern-ai-kernels/include
)
# Link against CUDA
find_package(CUDA REQUIRED)
target_link_libraries(your_target CUDA::cudart)How do I use Python bindings?
# Install from source
cd modern-ai-kernels
pip install -e .import tensorcraft_ops as tc
import numpy as np
# Use NumPy-compatible API
A = np.random.randn(1024, 1024).astype(np.float32)
B = np.random.randn(1024, 1024).astype(np.float32)
C = tc.gemm(A, B)How do I run tests?
# Build with development preset
cmake --preset dev
cmake --build --preset dev
# Run all tests
ctest --preset dev --output-on-failure
# Run specific test
ctest --preset dev -R gemmHow do I run benchmarks?
# Build benchmarks
cmake --preset dev
cmake --build --preset dev
# Run GEMM benchmark
./build/benchmarks/gemm_benchmark
# Run with specific filter
./build/benchmarks/gemm_benchmark --benchmark_filter="FP16"Troubleshooting
Build fails with "CUDA not found"
Ensure CUDA toolkit is installed and CUDA_PATH is set:
export CUDA_PATH=/usr/local/cuda
export PATH=$CUDA_PATH/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH"Unsupported GPU architecture" error
Check your GPU's compute capability:
nvidia-smi --query-gpu=compute_cap --format=csvTensorCraft-HPC requires SM70 (Volta) or later.
Numerical results differ from cuBLAS
Small numerical differences are expected due to:
- Floating-point order: Different reduction orders produce different rounding
- Mixed precision: Tensor Core uses mixed-precision accumulation
- Fused operations: Fused kernels may use different intermediate precision
For numerical validation, use relative error with an appropriate tolerance:
float rel_error = std::abs(your_result - reference) / std::abs(reference);
EXPECT_LT(rel_error, 1e-3); // 0.1% tolerancePerformance is lower than expected
Check these common issues:
- Clock speeds: GPU may be power-limited. Check with
nvidia-smi -q -d CLOCK - Memory bandwidth: Ensure data fits in GPU memory
- Warp divergence: Check for branch divergence in your kernel
- Shared memory bank conflicts: Use padding to avoid conflicts
Contributing
How do I contribute a new kernel?
- Create a specification in
openspec/changes/ - Implement the kernel header in
include/tensorcraft/kernels/ - Add GoogleTest tests in
tests/kernels/ - Add benchmarks in
benchmarks/ - Submit a pull request
See the Methodology section for detailed guidelines.
How do I report a bug?
Please file an issue at GitHub Issues with:
- GPU model and driver version
- CUDA version
- Minimal reproduction steps
- Expected vs. actual behavior