Skip to content

Learning Resources

A curated list of resources for learning CUDA programming and GPU kernel optimization.

Official NVIDIA Resources

Documentation

Libraries

  • cuBLAS — Dense linear algebra
  • cuDNN — Deep learning primitives
  • cuSPARSE — Sparse linear algebra
  • NCCL — Multi-GPU communication

Tools


Open Source Projects

Kernel Libraries

ProjectFocusDifficulty
CUTLASSGEMM, Tensor CoresAdvanced
FlashAttentionAttentionAdvanced
xFormersAttention, MemoryIntermediate
TritonDSL for kernelsIntermediate
DeepSpeedTraining optimizationAdvanced

Educational

ProjectDescription
CUDA ModeCUDA learning resources
GPU ModeGPU programming tutorials
Awesome CUDACurated CUDA resources

Books

GPU Programming

  • Programming Massively Parallel Processors — David B. Kirk, Wen-mei W. Hwu
    • The classic textbook for GPU computing
  • CUDA by Example — Jason Sanders, Edward Kandrot
    • Practical introduction to CUDA
  • Professional CUDA C Programming — John Cheng, Max Grossman, Phil McGachey
    • Advanced CUDA techniques

Computer Architecture

  • Computer Architecture: A Quantitative Approach — Hennessy & Patterson
    • Understanding memory hierarchies and parallelism

Online Courses


Key Concepts

Memory Hierarchy

Execution Model

Optimization Priority

  1. Maximize Parallelism — Enough threads to hide latency
  2. Coalesced Memory Access — Adjacent threads access adjacent memory
  3. Shared Memory Usage — Reduce global memory traffic
  4. Bank Conflict Avoidance — Ensure shared memory efficiency
  5. Occupancy Tuning — Balance registers, shared memory, threads

Performance Metrics

MetricDescriptionTarget
ThroughputOperations per secondRoofline limit
LatencyTime per operationMinimal
OccupancyActive warps / Max warps50-100%
Memory BandwidthBytes transferred / second~90% peak
Compute EfficiencyAchieved / Peak FLOPS>80% for GEMM

Common Pitfalls

Memory Coalescing

Non-coalesced memory access can reduce bandwidth by 10-32x. Always ensure adjacent threads access adjacent memory addresses.

Shared Memory Bank Conflicts

When multiple threads in a warp access the same bank, access is serialized. Use padding or access patterns to avoid.

Branch Divergence

Divergent branches within a warp execute both paths sequentially. Minimize control flow divergence.

Profiling First

Always profile before optimizing. Use Nsight Compute to identify actual bottlenecks rather than guessing.

Released under the Apache 2.0 License.