Learning Path
This document provides a 4-week learning plan for CUDA GEMM optimization, with daily tasks and exercises.
Learning Objectives
After completing this learning path, you will be able to:
- Understand GPU architecture and CUDA programming model
- Master 7-level GEMM optimization techniques
- Analyze and optimize CUDA kernel performance
- Read production-grade code like CUTLASS and cuBLAS
Week 1: CUDA Fundamentals
Day 1-2: Environment Setup & CUDA Programming Model
Topics:
- CUDA Toolkit installation and environment configuration
- CUDA programming model: host, device, kernel
- Thread hierarchy: grid, block, thread
- Memory hierarchy: global, shared, register, local
Practice:
bash
# Verify CUDA environment
nvcc --version
nvidia-smi
# Build project
cmake --preset default
cmake --build --preset defaultCode Reading:
src/naive_matmul.cu- Basic CUDA kernelinclude/common.h- CUDA error handling macros
Exercises:
- Modify
naive_matmul.cuto have each thread compute 2 output elements, observe performance changes - Use
cudaGetDevicePropertiesto query your GPU's thread block and warp parameters
Day 3-4: Memory Hierarchy
Topics:
- Global memory latency and throughput
- Shared memory usage and bank conflicts
- Register usage and spilling
Code Reading:
src/tiled_gemm.cu- Shared memory tilingsrc/coalesced_gemm.cu- Coalesced memory access
Exercises:
- Modify
BLOCK_SIZEintiled_gemm.cu, observe performance changes - Use Nsight Compute to analyze bank conflicts
Day 5-7: Level 1-2 Optimization Practice
Topics:
- Naive GEMM performance bottleneck analysis
- Tiled GEMM optimization principles
- Performance measurement methods
Practice:
bash
# Run benchmark
./build-release/benchmark --kernel=naive --kernel=tiledExercises:
- Calculate theoretical memory throughput for Naive GEMM, compare with measured values
- Analyze how much global memory access is reduced in Tiled GEMM vs Naive
Week 2: Intermediate Optimization
Day 1-2: Coalesced Access Optimization
Topics:
- Coalesced access principles
- Memory transactions and alignment
- Bank conflict avoidance
Code Reading:
src/coalesced_gemm.cu
Exercises:
- Use Nsight Compute to analyze memory throughput of
coalesced_gemm.cu - Try modifying thread block dimensions, observe impact on coalesced access
Day 3-4: Double Buffering Optimization
Topics:
- Compute and memory access overlap
- Double buffering technique
- Pipeline parallelism
Code Reading:
src/double_buffer_gemm.cu
Exercises:
- Analyze how double buffering hides memory latency
- Measure benefits of double buffering at different matrix sizes
Day 5-7: Level 3-4 Optimization Practice
Practice:
bash
# Compare performance
./build-release/benchmark --kernel=coalesced --kernel=double_bufferWeek 3: Advanced Optimization
Day 1-3: Register Blocking
Topics:
- Register blocking principles
- Arithmetic intensity and Roofline model
- Warp-level optimization
Code Reading:
src/optimized_gemm.cu- Register Blocked implementation
Exercises:
- Calculate arithmetic intensity of Register Blocked GEMM
- Use Roofline model to analyze performance upper bound
Day 4-5: Operator Fusion
Topics:
- Benefits of operator fusion
- GEMM + Bias + ReLU fusion
- Reducing kernel launch overhead
Code Reading:
src/fused_gemm.cu
Day 6-7: Vectorized Loading
Topics:
- float4 vectorized loading
- Memory alignment requirements
- Vectorization and performance
Code Reading:
src/vectorized_gemm.cu
Week 4: Engineering Practice
Day 1-2: Performance Analysis Tools
Topics:
- Nsight Compute usage
- Nsight Systems usage
- nvprof command-line tool
Practice:
bash
# Use Nsight Compute
ncu ./build-release/benchmark --kernel=vectorized
# Use Nsight Systems
nsys profile ./build-release/benchmarkDay 3-4: AutoTuner and Profiler
Topics:
- Auto-tuning principles
- Profiler design
- Parameter search strategies
Code Reading:
include/autotuner.hinclude/profiler.h
Day 5-7: Complete Project Practice
Practice:
- Implement a new kernel variant
- Use AutoTuner to find optimal parameters
- Add unit tests
- Submit PR
Further Reading
Papers
Volkov, Vasily. "Better performance at lower occupancy." GTC 2009.
- Classic paper on register blocking and warp-level optimization
Hong, Sunpyo, and Hyesoon Kim. "An analytical model for the GPU architecture." ISPASS 2009.
- GPU performance analysis model
Li, Shengen, et al. "The roofline model and performance analysis of GPU computing." IPDPS 2019.
- Roofline model applied to GPUs
Projects
- CUTLASS - NVIDIA's CUDA template library
- cuBLAS - NVIDIA's BLAS library
- FlashAttention - Attention mechanism optimization
Books
- Programming Massively Parallel Processors - David Kirk, Wen-mei Hwu
- CUDA C Best Practices Guide - NVIDIA official guide