Skip to content

Performance Analysis

This document introduces CUDA performance analysis tools and optimization methods.


Nsight Compute

Nsight Compute is NVIDIA's kernel-level performance analysis tool.

Basic Usage

bash
# Run analysis
ncu ./benchmark

# Detailed analysis
ncu --set full ./benchmark

# Specify kernel
ncu -k regex:gemm ./benchmark

Key Metrics

bash
# View all available metrics
ncu --query-metrics

# Common metric combination
ncu --metrics \
    gpu__time_duration.sum,\
    sm__warps_active.avg.pct_of_peak,\
    gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
    ./benchmark

Metric Interpretation

MetricMeaningTarget
gpu__time_duration.sumKernel execution timeLower is better
sm__warps_active.avg.pct_of_peakActive warp ratio> 80%
gpu__dram_throughputGlobal memory throughput> 80%
l1tex__data_bank_conflictsBank conflict countNear 0

Nsight Systems

Nsight Systems is a system-level performance analysis tool for analyzing kernel timeline and concurrency.

Basic Usage

bash
# Generate timeline report
nsys profile ./benchmark

# View report
nsys-ui ./report.nsys-rep

Analysis Content

  • Kernel execution timeline
  • CPU-GPU concurrency
  • CUDA API calls
  • Memory transfers

Performance Optimization Methods

1. Occupancy Optimization

Occupancy = Active warps / Maximum warps

cuda
// Calculate occupancy
int threads_per_block = BLOCK_SIZE * BLOCK_SIZE;
int blocks_per_sm = max_threads_per_sm / threads_per_block;
int registers_per_thread = ...;  // From Nsight Compute
int shared_mem_per_block = ...;

// Check constraints
assert(threads_per_block <= 1024);
assert(registers_per_thread * threads_per_block <= 65536);
assert(shared_mem_per_block <= 48 * 1024);  // Or 164KB for A100

2. Memory Optimization

Check memory throughput:

bash
ncu --metrics gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
    ./benchmark --kernel=tiled

If throughput is low, check:

  • Is coalesced access correct?
  • Is shared memory used sufficiently?
  • Are there bank conflicts?

3. Compute Optimization

Check compute throughput:

bash
ncu --metrics sm__pipe_fma_cycles_active.avg.pct_of_peak \
    ./benchmark

If compute throughput is low:

  • Increase computation per thread (register blocking)
  • Reduce synchronization overhead
  • Utilize Tensor Core

AutoTuner Usage

This project has a built-in AutoTuner for automatic parameter search:

cpp
#include "autotuner.h"

// Define parameter space
AutoTuner tuner;
tuner.add_param("BLOCK_SIZE", {16, 32, 64, 128});
tuner.add_param("TILE_M", {4, 8, 16});
tuner.add_param("TILE_N", {4, 8, 16});

// Search for optimal configuration
auto best = tuner.search(
    [](const Config& cfg) {
        return benchmark_gemm(cfg);
    }
);

std::cout << "Best config: " << best << std::endl;

Performance Baseline

RTX 3080 Reference Performance (1024×1024)

KernelTime (ms)TFLOPSvs cuBLAS
Naive15.20.1410%
Tiled7.60.2820%
Coalesced6.10.3525%
Double Buffer3.80.5640%
Register Blocked1.81.1985%
Fused1.91.1280%
Vectorized1.71.2589%
cuBLAS1.51.40100%

Performance Analysis Points

  1. Naive → Tiled: Shared memory reduces global access
  2. Tiled → Coalesced: Coalesced access improves throughput
  3. Coalesced → Double Buffer: Latency hiding
  4. Double Buffer → Register Blocked: Arithmetic intensity increase (biggest gain)
  5. Register Blocked → Vectorized: Vectorized loading

Common Issue Troubleshooting

Issue 1: Unstable Performance

Causes:

  • GPU frequency fluctuation
  • Thermal throttling
  • System load

Solution:

bash
# Check GPU status
nvidia-smi -q -d CLOCK,TEMPERATURE

# Lock GPU frequency
sudo nvidia-smi -lgc 1710  # Lock GPU clock

Issue 2: Low Occupancy

Causes:

  • Thread block too large or too small
  • Too many registers used
  • Too much shared memory used

Solution:

bash
# View resource usage
ncu --metrics launch__registers_per_thread,\
                launch__shared_memory_per_block \
    ./benchmark

Issue 3: Bank Conflicts

Detection:

bash
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
    ./benchmark

Solution:

  • Add padding
  • Adjust access pattern

References

MIT License | CUDA GEMM optimization tutorial