FAQ - Frequently Asked Questions
This document collects common questions about CUDA GEMM optimization and project usage.
Environment & Build
Q1: How to verify CUDA environment is correctly installed?
# Check nvcc version
nvcc --version
# Check GPU driver
nvidia-smi
# Check CUDA path
echo $CUDA_HOMEQ2: Build cannot find CUDA?
Ensure CMAKE_CUDA_COMPILER points to the correct nvcc:
cmake --preset default -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvccQ3: Tests are skipped?
GPU tests require an available NVIDIA GPU. Without GPU, tests skip automatically:
MINI_INFERENCE_REQUIRE_CUDA_DEVICE(); // Auto-skips in no-GPU environmentsCUDA Programming
Q4: What is a warp?
A warp is the basic execution unit on NVIDIA GPUs, containing 32 threads. Threads within a warp execute in SIMD fashion.
Key points:
- Threads in a warp should execute the same code path (avoid branch divergence)
- Warp synchronization is implicit, no need for explicit
__syncthreads()
Q5: What is bank conflict?
Shared memory is divided into 32 banks, each 4 bytes wide. When multiple threads in a warp access different addresses in the same bank, bank conflict occurs, causing serialized access.
Avoidance:
// May cause bank conflict
__shared__ float data[BLOCK_SIZE][BLOCK_SIZE];
float val = data[threadIdx.y][threadIdx.x];
// Add padding to avoid bank conflict
__shared__ float data[BLOCK_SIZE][BLOCK_SIZE + 1];Q6: What is coalesced access?
When threads in a warp access consecutive global memory addresses, these accesses can be coalesced into a single memory transaction, greatly improving throughput.
Best practice:
// Coalesced: threads access consecutive addresses
int idx = threadIdx.x;
float val = data[idx]; // Coalesced into one 128-byte transaction
// Non-coalesced: threads access strided addresses
int idx = threadIdx.x * stride; // stride > 1
float val = data[idx]; // Multiple independent transactionsGEMM Optimization
Q7: Why is Naive GEMM so slow?
Naive GEMM bottlenecks:
- Repeated memory access: Each output element requires reading K A elements and K B elements
- No coalesced access: Thread access patterns are non-consecutive
- Low arithmetic intensity: Compute/memory ratio is very low
Q8: Why does Tiled GEMM improve performance?
Tiled GEMM uses shared memory to cache data:
- Reduced global memory access: Data is reused in shared memory
- Increased arithmetic intensity: Each loaded data participates in multiple computations
- Predictable access patterns: Easier to achieve coalesced access
Q9: How does double buffering hide latency?
Double buffering uses two shared memory buffers:
// Pseudocode
while (more_tiles) {
// Compute current tile, load next tile simultaneously
compute(buffer[current]);
load(buffer[next]);
swap(current, next);
}Compute and memory access overlap, hiding memory latency.
Q10: Why is register blocking the most important optimization?
Register blocking lets each thread compute a small tile:
- Maximized register usage: Registers are the fastest storage
- Increased arithmetic intensity: Each loaded data participates in more computations
- Reduced synchronization: Intra-thread computation needs no synchronization
Q11: Why not just use cuBLAS?
This project is for educational purposes. The goal is understanding optimization principles. Production should use cuBLAS.
Value of understanding principles:
- Explain GPU performance optimization in interviews
- Implement custom kernels when cuBLAS doesn't support your scenario
- Understand CUTLASS and other advanced libraries
Performance Analysis
Q12: How to use Nsight Compute?
# Detailed analysis of single kernel
ncu --set full ./benchmark --kernel=vectorized
# Only key metrics
ncu --metrics gpu__time_duration.sum ./benchmarkQ13: How to interpret performance data?
Key metrics:
| Metric | Meaning | Target |
|---|---|---|
| SM Efficiency | SM utilization | > 80% |
| Memory Throughput | Memory throughput | Near peak |
| Warp Execution Efficiency | Warp execution efficiency | > 90% |
| Achieved Occupancy | Actual occupancy | Depends on kernel |
Q14: Why is my kernel performance unstable?
Possible causes:
- GPU frequency fluctuation: Check
nvidia-smi -q -d CLOCK - Thermal throttling: Check temperature
nvidia-smi -q -d TEMPERATURE - System load: Close other GPU applications
- Memory fragmentation: Restart program or system
Project Usage
Q15: How to add a new GEMM kernel?
- Declare kernel in
include/kernels.cuh - Create implementation file in
src/ - Add test in
tests/ - Add benchmark in
benchmarks/
Q16: How to use AutoTuner?
#include "autotuner.h"
AutoTuner tuner;
tuner.add_param("BLOCK_SIZE", {16, 32, 64, 128});
tuner.add_param("TILE_M", {4, 8, 16});
auto best = tuner.search(gemm_kernel);Q17: How to use MemoryPool?
#include "memory_pool.h"
MemoryPool pool(1024 * 1024 * 100); // 100 MB pool
auto* ptr = pool.allocate<float>(1024);
// Use ptr...
pool.deallocate(ptr);Theory Questions
Q18: What is the Roofline model?
The Roofline model is a visual performance analysis framework:
Performance = min(Peak FLOPS, Peak BW * AI)Where AI (Arithmetic Intensity) = FLOPs / Bytes
Significance:
- Low AI kernels are memory bandwidth limited
- High AI kernels are compute limited
Q19: How to calculate GEMM's arithmetic intensity?
For M×N×K GEMM:
FLOPs = 2 * M * N * K (each output needs K multiplies and K-1 adds)
Bytes = 4 * (M*K + K*N + M*N) (assuming float32)
AI = FLOPs / Bytes = 2*M*N*K / (4*(M*K + K*N + M*N))For square matrices (M=N=K):
AI = 2*M^3 / (12*M^2) = M/6Q20: Why does FP16 improve performance?
- Halved memory bandwidth: Data size reduced
- Tensor Core: Leverage dedicated hardware
- More parallelism: Same register count stores more data
More Questions?
If you have other questions:
- Ask at GitHub Issues
- Check CUDA Best Practices Guide
- Read Project Documentation