Learning Resources
A curated list of resources for learning CUDA programming and GPU kernel optimization.
Official NVIDIA Resources
Documentation
- CUDA C++ Programming Guide — The authoritative reference for CUDA programming
- CUDA Best Practices Guide — Optimization strategies and common pitfalls
- CUDA Profiler Tools Interface (CUPTI) — For profiling CUDA applications
Libraries
- cuBLAS — Dense linear algebra
- cuDNN — Deep learning primitives
- cuSPARSE — Sparse linear algebra
- NCCL — Multi-GPU communication
Tools
- Nsight Compute — Kernel profiling and analysis
- Nsight Systems — System-wide profiling
- NVIDIA Visual Profiler — Legacy GUI profiler
Open Source Projects
Kernel Libraries
| Project | Focus | Difficulty |
|---|---|---|
| CUTLASS | GEMM, Tensor Cores | Advanced |
| FlashAttention | Attention | Advanced |
| xFormers | Attention, Memory | Intermediate |
| Triton | DSL for kernels | Intermediate |
| DeepSpeed | Training optimization | Advanced |
Educational
| Project | Description |
|---|---|
| CUDA Mode | CUDA learning resources |
| GPU Mode | GPU programming tutorials |
| Awesome CUDA | Curated CUDA resources |
Books
GPU Programming
- Programming Massively Parallel Processors — David B. Kirk, Wen-mei W. Hwu
- The classic textbook for GPU computing
- CUDA by Example — Jason Sanders, Edward Kandrot
- Practical introduction to CUDA
- Professional CUDA C Programming — John Cheng, Max Grossman, Phil McGachey
- Advanced CUDA techniques
Computer Architecture
- Computer Architecture: A Quantitative Approach — Hennessy & Patterson
- Understanding memory hierarchies and parallelism
Online Courses
- NVIDIA Deep Learning Institute — Official NVIDIA courses
- CMU 15-418: Parallel Computer Architecture — Excellent course on parallelism
- MIT 6.172: Performance Engineering — Software performance optimization
Key Concepts
Memory Hierarchy
Execution Model
Optimization Priority
- Maximize Parallelism — Enough threads to hide latency
- Coalesced Memory Access — Adjacent threads access adjacent memory
- Shared Memory Usage — Reduce global memory traffic
- Bank Conflict Avoidance — Ensure shared memory efficiency
- Occupancy Tuning — Balance registers, shared memory, threads
Performance Metrics
| Metric | Description | Target |
|---|---|---|
| Throughput | Operations per second | Roofline limit |
| Latency | Time per operation | Minimal |
| Occupancy | Active warps / Max warps | 50-100% |
| Memory Bandwidth | Bytes transferred / second | ~90% peak |
| Compute Efficiency | Achieved / Peak FLOPS | >80% for GEMM |
Common Pitfalls
Memory Coalescing
Non-coalesced memory access can reduce bandwidth by 10-32x. Always ensure adjacent threads access adjacent memory addresses.
Shared Memory Bank Conflicts
When multiple threads in a warp access the same bank, access is serialized. Use padding or access patterns to avoid.
Branch Divergence
Divergent branches within a warp execute both paths sequentially. Minimize control flow divergence.
Profiling First
Always profile before optimizing. Use Nsight Compute to identify actual bottlenecks rather than guessing.