Performance Analysis

This document introduces CUDA performance analysis tools and optimization methods.

Nsight Compute

Nsight Compute is NVIDIA's kernel-level performance analysis tool.

Basic Usage

bash

# Run analysis
ncu ./benchmark

# Detailed analysis
ncu --set full ./benchmark

# Specify kernel
ncu -k regex:gemm ./benchmark

Key Metrics

bash

# View all available metrics
ncu --query-metrics

# Common metric combination
ncu --metrics \
    gpu__time_duration.sum,\
    sm__warps_active.avg.pct_of_peak,\
    gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
    ./benchmark

Metric Interpretation

Metric	Meaning	Target
gpu__time_duration.sum	Kernel execution time	Lower is better
sm__warps_active.avg.pct_of_peak	Active warp ratio	> 80%
gpu__dram_throughput	Global memory throughput	> 80%
l1tex__data_bank_conflicts	Bank conflict count	Near 0

Nsight Systems

Nsight Systems is a system-level performance analysis tool for analyzing kernel timeline and concurrency.

Basic Usage

bash

# Generate timeline report
nsys profile ./benchmark

# View report
nsys-ui ./report.nsys-rep

Analysis Content

Kernel execution timeline
CPU-GPU concurrency
CUDA API calls
Memory transfers

Performance Optimization Methods

1. Occupancy Optimization

Occupancy = Active warps / Maximum warps

cuda

// Calculate occupancy
int threads_per_block = BLOCK_SIZE * BLOCK_SIZE;
int blocks_per_sm = max_threads_per_sm / threads_per_block;
int registers_per_thread = ...;  // From Nsight Compute
int shared_mem_per_block = ...;

// Check constraints
assert(threads_per_block <= 1024);
assert(registers_per_thread * threads_per_block <= 65536);
assert(shared_mem_per_block <= 48 * 1024);  // Or 164KB for A100

2. Memory Optimization

Check memory throughput:

bash

ncu --metrics gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
    ./benchmark --kernel=tiled

If throughput is low, check:

Is coalesced access correct?
Is shared memory used sufficiently?
Are there bank conflicts?

3. Compute Optimization

Check compute throughput:

bash

ncu --metrics sm__pipe_fma_cycles_active.avg.pct_of_peak \
    ./benchmark

If compute throughput is low:

Increase computation per thread (register blocking)
Reduce synchronization overhead
Utilize Tensor Core

AutoTuner Usage

This project has a built-in AutoTuner for automatic parameter search:

cpp

#include "autotuner.h"

// Define parameter space
AutoTuner tuner;
tuner.add_param("BLOCK_SIZE", {16, 32, 64, 128});
tuner.add_param("TILE_M", {4, 8, 16});
tuner.add_param("TILE_N", {4, 8, 16});

// Search for optimal configuration
auto best = tuner.search(
    [](const Config& cfg) {
        return benchmark_gemm(cfg);
    }
);

std::cout << "Best config: " << best << std::endl;

Performance Baseline

RTX 3080 Reference Performance (1024×1024)

Kernel	Time (ms)	TFLOPS	vs cuBLAS
Naive	15.2	0.14	10%
Tiled	7.6	0.28	20%
Coalesced	6.1	0.35	25%
Double Buffer	3.8	0.56	40%
Register Blocked	1.8	1.19	85%
Fused	1.9	1.12	80%
Vectorized	1.7	1.25	89%
cuBLAS	1.5	1.40	100%