Learning Resources

A curated list of resources for learning CUDA programming and GPU kernel optimization.

Official NVIDIA Resources

Documentation

CUDA C++ Programming Guide — The authoritative reference for CUDA programming
CUDA Best Practices Guide — Optimization strategies and common pitfalls
CUDA Profiler Tools Interface (CUPTI) — For profiling CUDA applications

Libraries

cuBLAS — Dense linear algebra
cuDNN — Deep learning primitives
cuSPARSE — Sparse linear algebra
NCCL — Multi-GPU communication

Tools

Nsight Compute — Kernel profiling and analysis
Nsight Systems — System-wide profiling
NVIDIA Visual Profiler — Legacy GUI profiler

Open Source Projects

Kernel Libraries

Project	Focus	Difficulty
CUTLASS	GEMM, Tensor Cores	Advanced
FlashAttention	Attention	Advanced
xFormers	Attention, Memory	Intermediate
Triton	DSL for kernels	Intermediate
DeepSpeed	Training optimization	Advanced

Educational

Project	Description
CUDA Mode	CUDA learning resources
GPU Mode	GPU programming tutorials
Awesome CUDA	Curated CUDA resources

Books

GPU Programming

Programming Massively Parallel Processors — David B. Kirk, Wen-mei W. Hwu
- The classic textbook for GPU computing
CUDA by Example — Jason Sanders, Edward Kandrot
- Practical introduction to CUDA
Professional CUDA C Programming — John Cheng, Max Grossman, Phil McGachey
- Advanced CUDA techniques

Computer Architecture

Computer Architecture: A Quantitative Approach — Hennessy & Patterson
- Understanding memory hierarchies and parallelism

Online Courses

NVIDIA Deep Learning Institute — Official NVIDIA courses
CMU 15-418: Parallel Computer Architecture — Excellent course on parallelism
MIT 6.172: Performance Engineering — Software performance optimization

Key Concepts

Memory Hierarchy

Execution Model

Optimization Priority

Maximize Parallelism — Enough threads to hide latency
Coalesced Memory Access — Adjacent threads access adjacent memory
Shared Memory Usage — Reduce global memory traffic
Bank Conflict Avoidance — Ensure shared memory efficiency
Occupancy Tuning — Balance registers, shared memory, threads

Performance Metrics

Metric	Description	Target
Throughput	Operations per second	Roofline limit
Latency	Time per operation	Minimal
Occupancy	Active warps / Max warps	50-100%
Memory Bandwidth	Bytes transferred / second	~90% peak
Compute Efficiency	Achieved / Peak FLOPS	>80% for GEMM

Common Pitfalls

Memory Coalescing

Non-coalesced memory access can reduce bandwidth by 10-32x. Always ensure adjacent threads access adjacent memory addresses.

Shared Memory Bank Conflicts

When multiple threads in a warp access the same bank, access is serialized. Use padding or access patterns to avoid.

Branch Divergence

Divergent branches within a warp execute both paths sequentially. Minimize control flow divergence.

Profiling First

Always profile before optimizing. Use Nsight Compute to identify actual bottlenecks rather than guessing.

Learning Resources ​

Official NVIDIA Resources ​

Documentation ​

Libraries ​

Tools ​

Open Source Projects ​

Kernel Libraries ​

Educational ​

Books ​

GPU Programming ​

Computer Architecture ​

Online Courses ​

Key Concepts ​

Memory Hierarchy ​

Execution Model ​

Optimization Priority ​

Performance Metrics ​

Common Pitfalls ​