FAQ - Frequently Asked Questions

This document collects common questions about CUDA GEMM optimization and project usage.

Environment & Build

Q1: How to verify CUDA environment is correctly installed?

bash

# Check nvcc version
nvcc --version

# Check GPU driver
nvidia-smi

# Check CUDA path
echo $CUDA_HOME

Q2: Build cannot find CUDA?

Ensure CMAKE_CUDA_COMPILER points to the correct nvcc:

bash

cmake --preset default -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc

Q3: Tests are skipped?

GPU tests require an available NVIDIA GPU. Without GPU, tests skip automatically:

cpp

MINI_INFERENCE_REQUIRE_CUDA_DEVICE();  // Auto-skips in no-GPU environments

CUDA Programming

Q4: What is a warp?

A warp is the basic execution unit on NVIDIA GPUs, containing 32 threads. Threads within a warp execute in SIMD fashion.

Key points:

Threads in a warp should execute the same code path (avoid branch divergence)
Warp synchronization is implicit, no need for explicit __syncthreads()

Q5: What is bank conflict?

Shared memory is divided into 32 banks, each 4 bytes wide. When multiple threads in a warp access different addresses in the same bank, bank conflict occurs, causing serialized access.

Avoidance:

cuda

// May cause bank conflict
__shared__ float data[BLOCK_SIZE][BLOCK_SIZE];
float val = data[threadIdx.y][threadIdx.x];

// Add padding to avoid bank conflict
__shared__ float data[BLOCK_SIZE][BLOCK_SIZE + 1];

Q6: What is coalesced access?

When threads in a warp access consecutive global memory addresses, these accesses can be coalesced into a single memory transaction, greatly improving throughput.

Best practice:

cuda

// Coalesced: threads access consecutive addresses
int idx = threadIdx.x;
float val = data[idx];  // Coalesced into one 128-byte transaction

// Non-coalesced: threads access strided addresses
int idx = threadIdx.x * stride;  // stride > 1
float val = data[idx];  // Multiple independent transactions

GEMM Optimization

Q7: Why is Naive GEMM so slow?

Naive GEMM bottlenecks:

Repeated memory access: Each output element requires reading K A elements and K B elements
No coalesced access: Thread access patterns are non-consecutive
Low arithmetic intensity: Compute/memory ratio is very low

Q8: Why does Tiled GEMM improve performance?

Tiled GEMM uses shared memory to cache data:

Reduced global memory access: Data is reused in shared memory
Increased arithmetic intensity: Each loaded data participates in multiple computations
Predictable access patterns: Easier to achieve coalesced access

Q9: How does double buffering hide latency?

Double buffering uses two shared memory buffers:

cuda

// Pseudocode
while (more_tiles) {
    // Compute current tile, load next tile simultaneously
    compute(buffer[current]);
    load(buffer[next]);
    swap(current, next);
}

Compute and memory access overlap, hiding memory latency.

Q10: Why is register blocking the most important optimization?

Maximized register usage: Registers are the fastest storage
Increased arithmetic intensity: Each loaded data participates in more computations
Reduced synchronization: Intra-thread computation needs no synchronization

Q11: Why not just use cuBLAS?

This project is for educational purposes. The goal is understanding optimization principles. Production should use cuBLAS.

Value of understanding principles:

Explain GPU performance optimization in interviews
Implement custom kernels when cuBLAS doesn't support your scenario
Understand CUTLASS and other advanced libraries

Performance Analysis

Q12: How to use Nsight Compute?

bash

# Detailed analysis of single kernel
ncu --set full ./benchmark --kernel=vectorized

# Only key metrics
ncu --metrics gpu__time_duration.sum ./benchmark

Q13: How to interpret performance data?

Key metrics:

Metric	Meaning	Target
SM Efficiency	SM utilization	> 80%
Memory Throughput	Memory throughput	Near peak
Warp Execution Efficiency	Warp execution efficiency	> 90%
Achieved Occupancy	Actual occupancy	Depends on kernel

Q14: Why is my kernel performance unstable?

Possible causes:

GPU frequency fluctuation: Check nvidia-smi -q -d CLOCK
Thermal throttling: Check temperature nvidia-smi -q -d TEMPERATURE
System load: Close other GPU applications
Memory fragmentation: Restart program or system

Project Usage

Q15: How to add a new GEMM kernel?

Declare kernel in include/kernels.cuh
Create implementation file in src/
Add test in tests/
Add benchmark in benchmarks/

Q16: How to use AutoTuner?

cpp

#include "autotuner.h"

AutoTuner tuner;
tuner.add_param("BLOCK_SIZE", {16, 32, 64, 128});
tuner.add_param("TILE_M", {4, 8, 16});
auto best = tuner.search(gemm_kernel);

Q17: How to use MemoryPool?

cpp

#include "memory_pool.h"

MemoryPool pool(1024 * 1024 * 100);  // 100 MB pool
auto* ptr = pool.allocate<float>(1024);
// Use ptr...
pool.deallocate(ptr);

Theory Questions

Q18: What is the Roofline model?

The Roofline model is a visual performance analysis framework:

Performance = min(Peak FLOPS, Peak BW * AI)

Where AI (Arithmetic Intensity) = FLOPs / Bytes

Significance:

Low AI kernels are memory bandwidth limited
High AI kernels are compute limited

Q19: How to calculate GEMM's arithmetic intensity?

For M×N×K GEMM:

FLOPs = 2 * M * N * K  (each output needs K multiplies and K-1 adds)
Bytes = 4 * (M*K + K*N + M*N)  (assuming float32)
AI = FLOPs / Bytes = 2*M*N*K / (4*(M*K + K*N + M*N))

For square matrices (M=N=K):

AI = 2*M^3 / (12*M^2) = M/6

Q20: Why does FP16 improve performance?

Halved memory bandwidth: Data size reduced
Tensor Core: Leverage dedicated hardware
More parallelism: Same register count stores more data

FAQ - Frequently Asked Questions ​

Environment & Build ​

Q1: How to verify CUDA environment is correctly installed? ​

Q2: Build cannot find CUDA? ​

Q3: Tests are skipped? ​

CUDA Programming ​

Q4: What is a warp? ​

Q5: What is bank conflict? ​

Q6: What is coalesced access? ​

GEMM Optimization ​

Q7: Why is Naive GEMM so slow? ​

Q8: Why does Tiled GEMM improve performance? ​

Q9: How does double buffering hide latency? ​

Q10: Why is register blocking the most important optimization? ​

Q11: Why not just use cuBLAS? ​

Performance Analysis ​

Q12: How to use Nsight Compute? ​

Q13: How to interpret performance data? ​

Q14: Why is my kernel performance unstable? ​

Project Usage ​

Q15: How to add a new GEMM kernel? ​

Q16: How to use AutoTuner? ​

Q17: How to use MemoryPool? ​

Theory Questions ​

Q18: What is the Roofline model? ​

Q19: How to calculate GEMM's arithmetic intensity? ​

Q20: Why does FP16 improve performance? ​

More Questions? ​

FAQ - Frequently Asked Questions

Environment & Build

Q1: How to verify CUDA environment is correctly installed?

Q2: Build cannot find CUDA?

Q3: Tests are skipped?

CUDA Programming

Q4: What is a warp?

Q5: What is bank conflict?

Q6: What is coalesced access?

GEMM Optimization

Q7: Why is Naive GEMM so slow?

Q8: Why does Tiled GEMM improve performance?

Q9: How does double buffering hide latency?

Q10: Why is register blocking the most important optimization?

Q11: Why not just use cuBLAS?

Performance Analysis

Q12: How to use Nsight Compute?

Q13: How to interpret performance data?

Q14: Why is my kernel performance unstable?

Project Usage

Q15: How to add a new GEMM kernel?

Q16: How to use AutoTuner?

Q17: How to use MemoryPool?

Theory Questions

Q18: What is the Roofline model?

Q19: How to calculate GEMM's arithmetic intensity?

Q20: Why does FP16 improve performance?

More Questions?