Performance Analysis
This document introduces CUDA performance analysis tools and optimization methods.
Nsight Compute
Nsight Compute is NVIDIA's kernel-level performance analysis tool.
Basic Usage
bash
# Run analysis
ncu ./benchmark
# Detailed analysis
ncu --set full ./benchmark
# Specify kernel
ncu -k regex:gemm ./benchmarkKey Metrics
bash
# View all available metrics
ncu --query-metrics
# Common metric combination
ncu --metrics \
gpu__time_duration.sum,\
sm__warps_active.avg.pct_of_peak,\
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
./benchmarkMetric Interpretation
| Metric | Meaning | Target |
|---|---|---|
| gpu__time_duration.sum | Kernel execution time | Lower is better |
| sm__warps_active.avg.pct_of_peak | Active warp ratio | > 80% |
| gpu__dram_throughput | Global memory throughput | > 80% |
| l1tex__data_bank_conflicts | Bank conflict count | Near 0 |
Nsight Systems
Nsight Systems is a system-level performance analysis tool for analyzing kernel timeline and concurrency.
Basic Usage
bash
# Generate timeline report
nsys profile ./benchmark
# View report
nsys-ui ./report.nsys-repAnalysis Content
- Kernel execution timeline
- CPU-GPU concurrency
- CUDA API calls
- Memory transfers
Performance Optimization Methods
1. Occupancy Optimization
Occupancy = Active warps / Maximum warps
cuda
// Calculate occupancy
int threads_per_block = BLOCK_SIZE * BLOCK_SIZE;
int blocks_per_sm = max_threads_per_sm / threads_per_block;
int registers_per_thread = ...; // From Nsight Compute
int shared_mem_per_block = ...;
// Check constraints
assert(threads_per_block <= 1024);
assert(registers_per_thread * threads_per_block <= 65536);
assert(shared_mem_per_block <= 48 * 1024); // Or 164KB for A1002. Memory Optimization
Check memory throughput:
bash
ncu --metrics gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
./benchmark --kernel=tiledIf throughput is low, check:
- Is coalesced access correct?
- Is shared memory used sufficiently?
- Are there bank conflicts?
3. Compute Optimization
Check compute throughput:
bash
ncu --metrics sm__pipe_fma_cycles_active.avg.pct_of_peak \
./benchmarkIf compute throughput is low:
- Increase computation per thread (register blocking)
- Reduce synchronization overhead
- Utilize Tensor Core
AutoTuner Usage
This project has a built-in AutoTuner for automatic parameter search:
cpp
#include "autotuner.h"
// Define parameter space
AutoTuner tuner;
tuner.add_param("BLOCK_SIZE", {16, 32, 64, 128});
tuner.add_param("TILE_M", {4, 8, 16});
tuner.add_param("TILE_N", {4, 8, 16});
// Search for optimal configuration
auto best = tuner.search(
[](const Config& cfg) {
return benchmark_gemm(cfg);
}
);
std::cout << "Best config: " << best << std::endl;Performance Baseline
RTX 3080 Reference Performance (1024×1024)
| Kernel | Time (ms) | TFLOPS | vs cuBLAS |
|---|---|---|---|
| Naive | 15.2 | 0.14 | 10% |
| Tiled | 7.6 | 0.28 | 20% |
| Coalesced | 6.1 | 0.35 | 25% |
| Double Buffer | 3.8 | 0.56 | 40% |
| Register Blocked | 1.8 | 1.19 | 85% |
| Fused | 1.9 | 1.12 | 80% |
| Vectorized | 1.7 | 1.25 | 89% |
| cuBLAS | 1.5 | 1.40 | 100% |
Performance Analysis Points
- Naive → Tiled: Shared memory reduces global access
- Tiled → Coalesced: Coalesced access improves throughput
- Coalesced → Double Buffer: Latency hiding
- Double Buffer → Register Blocked: Arithmetic intensity increase (biggest gain)
- Register Blocked → Vectorized: Vectorized loading
Common Issue Troubleshooting
Issue 1: Unstable Performance
Causes:
- GPU frequency fluctuation
- Thermal throttling
- System load
Solution:
bash
# Check GPU status
nvidia-smi -q -d CLOCK,TEMPERATURE
# Lock GPU frequency
sudo nvidia-smi -lgc 1710 # Lock GPU clockIssue 2: Low Occupancy
Causes:
- Thread block too large or too small
- Too many registers used
- Too much shared memory used
Solution:
bash
# View resource usage
ncu --metrics launch__registers_per_thread,\
launch__shared_memory_per_block \
./benchmarkIssue 3: Bank Conflicts
Detection:
bash
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
./benchmarkSolution:
- Add padding
- Adjust access pattern