- CUDA Toolkit 12.x
- CMake 3.18+
- C++17 compatible compiler
- NVIDIA GPU (Compute Capability 7.0+)
📊
Progressive Optimization
Naive → Tiled → Coalesced → Double Buffer → Register Blocked → Fused → Vectorized
A 7-level progressive optimization path from naive to ~85% cuBLAS performance
git clone https://github.com/LessUp/mini-inference-engine.git
cd mini-inference-engine
cmake --preset release
cmake --build --preset release
./build-release/benchmark