Skip to content

Mini-Inference EngineCUDA GEMM Optimization Tutorial & Mini Inference Engine

A 7-level progressive optimization path from naive to ~85% cuBLAS performance

Performance Overview

Performance vs cuBLAS (RTX 3080, 1024×1024)
L1 Naive
10%
L2 Tiled
20%
L3 Coalesced
25%
L4 Double Buffer
40%
L5 Register Blocked
85%
L6 Fused
80%
L7 Vectorized
85%
cuBLAS = 100% baseline

Quick Start

Requirements
  • CUDA Toolkit 12.x
  • CMake 3.18+
  • C++17 compatible compiler
  • NVIDIA GPU (Compute Capability 7.0+)
bash
git clone https://github.com/LessUp/mini-inference-engine.git
cd mini-inference-engine
cmake --preset release
cmake --build --preset release
./build-release/benchmark
💡 Design Philosophy
This project adopts a progressive teaching approach where each optimization level builds upon the previous one. This allows you to clearly understand the performance gains and underlying principles of each optimization technique. Starting from the most basic naive implementation, we progressively introduce shared memory, memory coalescing, double buffering, register blocking, operator fusion, and vectorized loads to ultimately achieve near-cuBLAS performance.

MIT License | CUDA GEMM optimization tutorial