Skip to content

FAQ - Frequently Asked Questions

This document collects common questions about CUDA GEMM optimization and project usage.


Environment & Build

Q1: How to verify CUDA environment is correctly installed?

bash
# Check nvcc version
nvcc --version

# Check GPU driver
nvidia-smi

# Check CUDA path
echo $CUDA_HOME

Q2: Build cannot find CUDA?

Ensure CMAKE_CUDA_COMPILER points to the correct nvcc:

bash
cmake --preset default -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc

Q3: Tests are skipped?

GPU tests require an available NVIDIA GPU. Without GPU, tests skip automatically:

cpp
MINI_INFERENCE_REQUIRE_CUDA_DEVICE();  // Auto-skips in no-GPU environments

CUDA Programming

Q4: What is a warp?

A warp is the basic execution unit on NVIDIA GPUs, containing 32 threads. Threads within a warp execute in SIMD fashion.

Key points:

  • Threads in a warp should execute the same code path (avoid branch divergence)
  • Warp synchronization is implicit, no need for explicit __syncthreads()

Q5: What is bank conflict?

Shared memory is divided into 32 banks, each 4 bytes wide. When multiple threads in a warp access different addresses in the same bank, bank conflict occurs, causing serialized access.

Avoidance:

cuda
// May cause bank conflict
__shared__ float data[BLOCK_SIZE][BLOCK_SIZE];
float val = data[threadIdx.y][threadIdx.x];

// Add padding to avoid bank conflict
__shared__ float data[BLOCK_SIZE][BLOCK_SIZE + 1];

Q6: What is coalesced access?

When threads in a warp access consecutive global memory addresses, these accesses can be coalesced into a single memory transaction, greatly improving throughput.

Best practice:

cuda
// Coalesced: threads access consecutive addresses
int idx = threadIdx.x;
float val = data[idx];  // Coalesced into one 128-byte transaction

// Non-coalesced: threads access strided addresses
int idx = threadIdx.x * stride;  // stride > 1
float val = data[idx];  // Multiple independent transactions

GEMM Optimization

Q7: Why is Naive GEMM so slow?

Naive GEMM bottlenecks:

  1. Repeated memory access: Each output element requires reading K A elements and K B elements
  2. No coalesced access: Thread access patterns are non-consecutive
  3. Low arithmetic intensity: Compute/memory ratio is very low

Q8: Why does Tiled GEMM improve performance?

Tiled GEMM uses shared memory to cache data:

  1. Reduced global memory access: Data is reused in shared memory
  2. Increased arithmetic intensity: Each loaded data participates in multiple computations
  3. Predictable access patterns: Easier to achieve coalesced access

Q9: How does double buffering hide latency?

Double buffering uses two shared memory buffers:

cuda
// Pseudocode
while (more_tiles) {
    // Compute current tile, load next tile simultaneously
    compute(buffer[current]);
    load(buffer[next]);
    swap(current, next);
}

Compute and memory access overlap, hiding memory latency.

Q10: Why is register blocking the most important optimization?

Register blocking lets each thread compute a small tile:

  1. Maximized register usage: Registers are the fastest storage
  2. Increased arithmetic intensity: Each loaded data participates in more computations
  3. Reduced synchronization: Intra-thread computation needs no synchronization

Q11: Why not just use cuBLAS?

This project is for educational purposes. The goal is understanding optimization principles. Production should use cuBLAS.

Value of understanding principles:

  • Explain GPU performance optimization in interviews
  • Implement custom kernels when cuBLAS doesn't support your scenario
  • Understand CUTLASS and other advanced libraries

Performance Analysis

Q12: How to use Nsight Compute?

bash
# Detailed analysis of single kernel
ncu --set full ./benchmark --kernel=vectorized

# Only key metrics
ncu --metrics gpu__time_duration.sum ./benchmark

Q13: How to interpret performance data?

Key metrics:

MetricMeaningTarget
SM EfficiencySM utilization> 80%
Memory ThroughputMemory throughputNear peak
Warp Execution EfficiencyWarp execution efficiency> 90%
Achieved OccupancyActual occupancyDepends on kernel

Q14: Why is my kernel performance unstable?

Possible causes:

  1. GPU frequency fluctuation: Check nvidia-smi -q -d CLOCK
  2. Thermal throttling: Check temperature nvidia-smi -q -d TEMPERATURE
  3. System load: Close other GPU applications
  4. Memory fragmentation: Restart program or system

Project Usage

Q15: How to add a new GEMM kernel?

  1. Declare kernel in include/kernels.cuh
  2. Create implementation file in src/
  3. Add test in tests/
  4. Add benchmark in benchmarks/

Q16: How to use AutoTuner?

cpp
#include "autotuner.h"

AutoTuner tuner;
tuner.add_param("BLOCK_SIZE", {16, 32, 64, 128});
tuner.add_param("TILE_M", {4, 8, 16});
auto best = tuner.search(gemm_kernel);

Q17: How to use MemoryPool?

cpp
#include "memory_pool.h"

MemoryPool pool(1024 * 1024 * 100);  // 100 MB pool
auto* ptr = pool.allocate<float>(1024);
// Use ptr...
pool.deallocate(ptr);

Theory Questions

Q18: What is the Roofline model?

The Roofline model is a visual performance analysis framework:

Performance = min(Peak FLOPS, Peak BW * AI)

Where AI (Arithmetic Intensity) = FLOPs / Bytes

Significance:

  • Low AI kernels are memory bandwidth limited
  • High AI kernels are compute limited

Q19: How to calculate GEMM's arithmetic intensity?

For M×N×K GEMM:

FLOPs = 2 * M * N * K  (each output needs K multiplies and K-1 adds)
Bytes = 4 * (M*K + K*N + M*N)  (assuming float32)
AI = FLOPs / Bytes = 2*M*N*K / (4*(M*K + K*N + M*N))

For square matrices (M=N=K):

AI = 2*M^3 / (12*M^2) = M/6

Q20: Why does FP16 improve performance?

  1. Halved memory bandwidth: Data size reduced
  2. Tensor Core: Leverage dedicated hardware
  3. More parallelism: Same register count stores more data

More Questions?

If you have other questions:

  1. Ask at GitHub Issues
  2. Check CUDA Best Practices Guide
  3. Read Project Documentation

MIT License | CUDA GEMM optimization tutorial