Skip to content

Industry Project Comparison Analysis

This document analyzes Mini-Inference Engine against mainstream GEMM/inference engine projects.


Overview Comparison

ProjectPurposeLanguageOptimization LevelThis Project's Difference
cuBLASNVIDIA Official BLASCUDA/CProductionThis project is educational, showing optimization step-by-step
CUTLASSCUDA Template LibraryC++/CUDAProductionThis project is simpler, suitable for beginners
llama.cppLLM Inference FrameworkC++ProductionThis project focuses on GEMM optimization teaching
vLLMLLM Service FrameworkPython/C++ProductionThis project is low-level kernel teaching
TensorRT-LLMNVIDIA Inference OptimizationC++/CUDAProductionThis project doesn't depend on TensorRT

vs cuBLAS

cuBLAS Characteristics

cuBLAS is NVIDIA's official BLAS library, representing peak GPU matrix operation performance:

Advantages:

  • Highly optimized kernels for all NVIDIA GPU architectures
  • Automatic algorithm selection
  • Supports FP32/FP16/TF32/INT8 and more precision types
  • Tensor Core acceleration

Limitations:

  • Closed source, cannot learn optimization techniques
  • No custom kernel fusion support
  • Requires additional API learning

This Project's Position

This project uses cuBLAS as performance baseline, with goals:

  1. Understand optimization principles: Show optimization path from Naive to ~85% cuBLAS
  2. Readable & learnable: Every optimization level has detailed comments and explanations
  3. Modifiable & extensible: Easy to experiment and customize

Performance Comparison

Kernelvs cuBLASNotes
L1 Naive~10%Establish verifiable baseline
L2 Tiled~20%Shared memory tiling
L3 Coalesced~25%Coalesced access optimization
L4 Double Buffer~40%Latency hiding
L5 Register Blocked~85%Near cuBLAS
L6 Fused~80%Operator fusion (additional benefit)
L7 Vectorized~89%Vectorized loading

vs CUTLASS

CUTLASS Characteristics

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source GEMM template library:

Advantages:

  • Template-based design, highly configurable
  • Tensor Core support
  • Extremely high code quality, excellent learning material
  • Continuous updates, following latest architectures

Limitations:

  • Steep learning curve
  • Large codebase, difficult to quickly start
  • Requires deep template metaprogramming knowledge

Relationship with CUTLASS

This project can serve as pre-requisite learning material for CUTLASS:

Mini-Inference Engine → CUTLASS → cuBLAS
      (Beginner)       (Advanced)  (Production)

Recommended learning path:

  1. Understand GEMM optimization basics through this project
  2. Read CUTLASS source code for advanced techniques
  3. Use cuBLAS for production development

vs llama.cpp

llama.cpp Characteristics

llama.cpp is an LLM inference framework by Georgi Gerganov:

Advantages:

  • Pure C/C++ implementation, no external dependencies
  • Multiple quantization formats (GGUF)
  • CPU and GPU backends
  • Extensive community support

GEMM Related:

  • Uses custom matrix multiplication kernels
  • Optimized for quantized matrices (Q4/Q5/Q8)
  • Supports Apple Metal, CUDA, ROCm backends

Relationship with llama.cpp

This project focuses on general GEMM optimization, while llama.cpp specializes in quantized matrix multiplication:

AspectThis Projectllama.cpp
Matrix TypeFP32/FP16Quantized (Q4/Q5/Q8)
Optimization TargetGeneral GEMMInference scenarios
Backend SupportCUDACPU/CUDA/Metal/ROCm
Learning ValueGEMM principlesInference system design

vs vLLM

vLLM Characteristics

vLLM is a high-performance LLM serving framework:

Core Technologies:

  • PagedAttention: Efficient KV Cache management
  • Continuous batching: Improved throughput
  • CUDA Graph: Reduced kernel launch overhead
  • Tensor Parallel: Multi-GPU parallelism

Performance:

  • 2-4× throughput improvement vs traditional HuggingFace implementation
  • Supports multiple model architectures

Relationship with vLLM

This project is the low-level principle teaching for vLLM's GEMM kernels:

This Project (GEMM Kernel) → FlashAttention → vLLM (Service Framework)

After understanding this project's optimization techniques, you can better understand:

  • FlashAttention's tiling strategy
  • PagedAttention's memory management
  • Tensor Parallel's communication optimization

vs TensorRT-LLM

TensorRT-LLM Characteristics

TensorRT-LLM is NVIDIA's official LLM inference optimization library:

Core Technologies:

  • Rich kernel library
  • Automatic graph optimization
  • Multi-GPU parallelism
  • Tensor Core acceleration

Performance:

  • Llama2-13B near 12,000 tok/s on H200
  • 40,000+ tok/s on B200 GPUs

Relationship with TensorRT-LLM

This project doesn't depend on TensorRT, but learned techniques are transferable:

This Project's TechniqueTensorRT-LLM Application
Shared memory tilingAll GEMM kernels
Double bufferingFlashAttention
Operator fusionGraph optimizer
Vectorized loadingTensor Core

Academic Paper Citations

This project's optimization techniques come from these academic papers:

Classic Papers

  1. Volkov, Vasily. "Better performance at lower occupancy." GTC 2009.

    • Register blocking and warp-level optimization
  2. Hong, Sunpyo, and Hyesoon Kim. "An analytical model for the GPU architecture." ISPASS 2009.

    • GPU performance analysis model
  3. Baghsorkhi, Sara S., et al. "An analytical model for GPU memory accesses." ISPASS 2012.

    • Memory access model

Recent Papers

  1. Dao, Tri, et al. "FlashAttention: Fast and memory-efficient exact attention." NeurIPS 2022.

    • Attention mechanism optimization
  2. Kwon, Woosuk, et al. "Efficient memory management for large language model serving with PagedAttention." SOSP 2023.

    • vLLM's core paper

Summary

This Project's Unique Value

  1. Progressive learning: From Naive to ~85% cuBLAS, every step verifiable
  2. Complete engineering: Not isolated kernels, but complete inference engine skeleton
  3. Bilingual documentation: Full Chinese and English docs, suitable for Chinese learners
  4. Focused repository: Fewer workflow layers, easier to build, read, and maintain
Week 1-2: This Project (GEMM Basics)

Week 3-4: CUTLASS Source Code Reading

Week 5-6: FlashAttention Paper + Implementation

Week 7+:  vLLM / TensorRT-LLM Architecture Research

MIT License | CUDA GEMM optimization tutorial