Learning Path

This document provides a 4-week learning plan for CUDA GEMM optimization, with daily tasks and exercises.

Learning Objectives

After completing this learning path, you will be able to:

Understand GPU architecture and CUDA programming model
Master 7-level GEMM optimization techniques
Analyze and optimize CUDA kernel performance
Read production-grade code like CUTLASS and cuBLAS

Week 1: CUDA Fundamentals

Day 1-2: Environment Setup & CUDA Programming Model

Topics:

CUDA Toolkit installation and environment configuration
CUDA programming model: host, device, kernel
Thread hierarchy: grid, block, thread
Memory hierarchy: global, shared, register, local

Practice:

bash

# Verify CUDA environment
nvcc --version
nvidia-smi

# Build project
cmake --preset default
cmake --build --preset default

Code Reading:

src/naive_matmul.cu - Basic CUDA kernel
include/common.h - CUDA error handling macros

Exercises:

Modify naive_matmul.cu to have each thread compute 2 output elements, observe performance changes
Use cudaGetDeviceProperties to query your GPU's thread block and warp parameters

Day 3-4: Memory Hierarchy

Topics:

Global memory latency and throughput
Shared memory usage and bank conflicts
Register usage and spilling

Code Reading:

src/tiled_gemm.cu - Shared memory tiling
src/coalesced_gemm.cu - Coalesced memory access

Exercises:

Modify BLOCK_SIZE in tiled_gemm.cu, observe performance changes
Use Nsight Compute to analyze bank conflicts

Day 5-7: Level 1-2 Optimization Practice

Topics:

Naive GEMM performance bottleneck analysis
Tiled GEMM optimization principles
Performance measurement methods

Practice:

bash

# Run benchmark
./build-release/benchmark --kernel=naive --kernel=tiled

Exercises:

Calculate theoretical memory throughput for Naive GEMM, compare with measured values
Analyze how much global memory access is reduced in Tiled GEMM vs Naive

Week 2: Intermediate Optimization

Day 1-2: Coalesced Access Optimization

Topics:

Coalesced access principles
Memory transactions and alignment
Bank conflict avoidance

Code Reading:

src/coalesced_gemm.cu

Exercises:

Use Nsight Compute to analyze memory throughput of coalesced_gemm.cu
Try modifying thread block dimensions, observe impact on coalesced access

Day 3-4: Double Buffering Optimization

Topics:

Compute and memory access overlap
Double buffering technique
Pipeline parallelism

Code Reading:

src/double_buffer_gemm.cu

Exercises:

Analyze how double buffering hides memory latency
Measure benefits of double buffering at different matrix sizes

Day 5-7: Level 3-4 Optimization Practice

Practice:

bash

# Compare performance
./build-release/benchmark --kernel=coalesced --kernel=double_buffer

Week 3: Advanced Optimization

Day 1-3: Register Blocking

Topics:

Register blocking principles
Arithmetic intensity and Roofline model
Warp-level optimization

Code Reading:

src/optimized_gemm.cu - Register Blocked implementation

Exercises:

Calculate arithmetic intensity of Register Blocked GEMM
Use Roofline model to analyze performance upper bound

Day 4-5: Operator Fusion

Topics:

Benefits of operator fusion
GEMM + Bias + ReLU fusion
Reducing kernel launch overhead

Code Reading:

src/fused_gemm.cu

Day 6-7: Vectorized Loading

Topics:

float4 vectorized loading
Memory alignment requirements
Vectorization and performance

Code Reading:

src/vectorized_gemm.cu

Week 4: Engineering Practice

Day 1-2: Performance Analysis Tools

Topics:

Nsight Compute usage
Nsight Systems usage
nvprof command-line tool

Practice:

bash

# Use Nsight Compute
ncu ./build-release/benchmark --kernel=vectorized

# Use Nsight Systems
nsys profile ./build-release/benchmark

Day 3-4: AutoTuner and Profiler

Topics:

Auto-tuning principles
Profiler design
Parameter search strategies

Code Reading:

include/autotuner.h
include/profiler.h

Day 5-7: Complete Project Practice

Practice:

Implement a new kernel variant
Use AutoTuner to find optimal parameters
Add unit tests
Submit PR

Learning Path

Learning Objectives

Week 1: CUDA Fundamentals

Day 1-2: Environment Setup & CUDA Programming Model

Day 3-4: Memory Hierarchy

Day 5-7: Level 1-2 Optimization Practice

Week 2: Intermediate Optimization

Day 1-2: Coalesced Access Optimization

Day 3-4: Double Buffering Optimization

Day 5-7: Level 3-4 Optimization Practice

Week 3: Advanced Optimization

Day 1-3: Register Blocking

Day 4-5: Operator Fusion

Day 6-7: Vectorized Loading

Week 4: Engineering Practice

Day 1-2: Performance Analysis Tools

Day 3-4: AutoTuner and Profiler

Day 5-7: Complete Project Practice

Further Reading

Papers

Projects

Books

Learning Path ​

Learning Objectives ​

Week 1: CUDA Fundamentals ​

Day 1-2: Environment Setup & CUDA Programming Model ​

Day 3-4: Memory Hierarchy ​

Day 5-7: Level 1-2 Optimization Practice ​

Week 2: Intermediate Optimization ​

Day 1-2: Coalesced Access Optimization ​

Day 3-4: Double Buffering Optimization ​

Day 5-7: Level 3-4 Optimization Practice ​

Week 3: Advanced Optimization ​

Day 1-3: Register Blocking ​

Day 4-5: Operator Fusion ​

Day 6-7: Vectorized Loading ​

Week 4: Engineering Practice ​

Day 1-2: Performance Analysis Tools ​

Day 3-4: AutoTuner and Profiler ​

Day 5-7: Complete Project Practice ​

Further Reading ​

Papers ​

Projects ​

Books ​

Learning Path

Learning Objectives

Week 1: CUDA Fundamentals

Day 1-2: Environment Setup & CUDA Programming Model

Day 3-4: Memory Hierarchy

Day 5-7: Level 1-2 Optimization Practice

Week 2: Intermediate Optimization

Day 1-2: Coalesced Access Optimization

Day 3-4: Double Buffering Optimization

Day 5-7: Level 3-4 Optimization Practice

Week 3: Advanced Optimization

Day 1-3: Register Blocking

Day 4-5: Operator Fusion

Day 6-7: Vectorized Loading

Week 4: Engineering Practice

Day 1-2: Performance Analysis Tools

Day 3-4: AutoTuner and Profiler

Day 5-7: Complete Project Practice

Further Reading

Papers

Projects

Books