Mini-Inference EngineCUDA GEMM Optimization Tutorial & Mini Inference Engine

A 7-level progressive optimization path from naive to ~85% cuBLAS performance

Getting Started

GitHub

📊

Progressive Optimization

Naive → Tiled → Coalesced → Double Buffer → Register Blocked → Fused → Vectorized

🏗️

Complete Engineering Skeleton

Tensor, InferenceEngine, MemoryPool, StreamManager, AutoTuner, Profiler

📈

Performance Analysis Tools

Built-in AutoTuner, Profiler, and Benchmark utilities

🔬

Deep Technical Insights

Memory hierarchy, warp synchronization, bank conflicts, vectorized loads

Performance Overview

Performance vs cuBLAS (RTX 3080, 1024×1024)

L1 Naive

10%

L2 Tiled

20%

L3 Coalesced

25%

L4 Double Buffer

40%

L5 Register Blocked

85%

L6 Fused

80%

L7 Vectorized

85%

cuBLAS = 100% baseline

Quick Start

Requirements

CUDA Toolkit 12.x
CMake 3.18+
C++17 compatible compiler
NVIDIA GPU (Compute Capability 7.0+)

bash

git clone https://github.com/LessUp/mini-inference-engine.git
cd mini-inference-engine
cmake --preset release
cmake --build --preset release
./build-release/benchmark

💡 Design Philosophy

This project adopts a progressive teaching approach where each optimization level builds upon the previous one. This allows you to clearly understand the performance gains and underlying principles of each optimization technique. Starting from the most basic naive implementation, we progressively introduce shared memory, memory coalescing, double buffering, register blocking, operator fusion, and vectorized loads to ultimately achieve near-cuBLAS performance.