Skip to content

Learning Path

This document provides a 4-week learning plan for CUDA GEMM optimization, with daily tasks and exercises.

Learning Objectives

After completing this learning path, you will be able to:

  • Understand GPU architecture and CUDA programming model
  • Master 7-level GEMM optimization techniques
  • Analyze and optimize CUDA kernel performance
  • Read production-grade code like CUTLASS and cuBLAS

Week 1: CUDA Fundamentals

Day 1-2: Environment Setup & CUDA Programming Model

Topics:

  • CUDA Toolkit installation and environment configuration
  • CUDA programming model: host, device, kernel
  • Thread hierarchy: grid, block, thread
  • Memory hierarchy: global, shared, register, local

Practice:

bash
# Verify CUDA environment
nvcc --version
nvidia-smi

# Build project
cmake --preset default
cmake --build --preset default

Code Reading:

  • src/naive_matmul.cu - Basic CUDA kernel
  • include/common.h - CUDA error handling macros

Exercises:

  1. Modify naive_matmul.cu to have each thread compute 2 output elements, observe performance changes
  2. Use cudaGetDeviceProperties to query your GPU's thread block and warp parameters

Day 3-4: Memory Hierarchy

Topics:

  • Global memory latency and throughput
  • Shared memory usage and bank conflicts
  • Register usage and spilling

Code Reading:

  • src/tiled_gemm.cu - Shared memory tiling
  • src/coalesced_gemm.cu - Coalesced memory access

Exercises:

  1. Modify BLOCK_SIZE in tiled_gemm.cu, observe performance changes
  2. Use Nsight Compute to analyze bank conflicts

Day 5-7: Level 1-2 Optimization Practice

Topics:

  • Naive GEMM performance bottleneck analysis
  • Tiled GEMM optimization principles
  • Performance measurement methods

Practice:

bash
# Run benchmark
./build-release/benchmark --kernel=naive --kernel=tiled

Exercises:

  1. Calculate theoretical memory throughput for Naive GEMM, compare with measured values
  2. Analyze how much global memory access is reduced in Tiled GEMM vs Naive

Week 2: Intermediate Optimization

Day 1-2: Coalesced Access Optimization

Topics:

  • Coalesced access principles
  • Memory transactions and alignment
  • Bank conflict avoidance

Code Reading:

  • src/coalesced_gemm.cu

Exercises:

  1. Use Nsight Compute to analyze memory throughput of coalesced_gemm.cu
  2. Try modifying thread block dimensions, observe impact on coalesced access

Day 3-4: Double Buffering Optimization

Topics:

  • Compute and memory access overlap
  • Double buffering technique
  • Pipeline parallelism

Code Reading:

  • src/double_buffer_gemm.cu

Exercises:

  1. Analyze how double buffering hides memory latency
  2. Measure benefits of double buffering at different matrix sizes

Day 5-7: Level 3-4 Optimization Practice

Practice:

bash
# Compare performance
./build-release/benchmark --kernel=coalesced --kernel=double_buffer

Week 3: Advanced Optimization

Day 1-3: Register Blocking

Topics:

  • Register blocking principles
  • Arithmetic intensity and Roofline model
  • Warp-level optimization

Code Reading:

  • src/optimized_gemm.cu - Register Blocked implementation

Exercises:

  1. Calculate arithmetic intensity of Register Blocked GEMM
  2. Use Roofline model to analyze performance upper bound

Day 4-5: Operator Fusion

Topics:

  • Benefits of operator fusion
  • GEMM + Bias + ReLU fusion
  • Reducing kernel launch overhead

Code Reading:

  • src/fused_gemm.cu

Day 6-7: Vectorized Loading

Topics:

  • float4 vectorized loading
  • Memory alignment requirements
  • Vectorization and performance

Code Reading:

  • src/vectorized_gemm.cu

Week 4: Engineering Practice

Day 1-2: Performance Analysis Tools

Topics:

  • Nsight Compute usage
  • Nsight Systems usage
  • nvprof command-line tool

Practice:

bash
# Use Nsight Compute
ncu ./build-release/benchmark --kernel=vectorized

# Use Nsight Systems
nsys profile ./build-release/benchmark

Day 3-4: AutoTuner and Profiler

Topics:

  • Auto-tuning principles
  • Profiler design
  • Parameter search strategies

Code Reading:

  • include/autotuner.h
  • include/profiler.h

Day 5-7: Complete Project Practice

Practice:

  1. Implement a new kernel variant
  2. Use AutoTuner to find optimal parameters
  3. Add unit tests
  4. Submit PR

Further Reading

Papers

  1. Volkov, Vasily. "Better performance at lower occupancy." GTC 2009.

    • Classic paper on register blocking and warp-level optimization
  2. Hong, Sunpyo, and Hyesoon Kim. "An analytical model for the GPU architecture." ISPASS 2009.

    • GPU performance analysis model
  3. Li, Shengen, et al. "The roofline model and performance analysis of GPU computing." IPDPS 2019.

    • Roofline model applied to GPUs

Projects

Books

  • Programming Massively Parallel Processors - David Kirk, Wen-mei Hwu
  • CUDA C Best Practices Guide - NVIDIA official guide

MIT License | CUDA GEMM optimization tutorial