性能分析

本文档介绍 CUDA 性能分析工具和优化方法。

Nsight Compute

Nsight Compute 是 NVIDIA 的 kernel 级性能分析工具。

基本使用

bash

# 运行分析
ncu ./benchmark

# 详细分析
ncu --set full ./benchmark

# 指定 kernel
ncu -k regex:gemm ./benchmark

关键指标

bash

# 查看所有可用指标
ncu --query-metrics

# 常用指标组合
ncu --metrics \
    gpu__time_duration.sum,\
    sm__warps_active.avg.pct_of_peak,\
    gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
    ./benchmark

指标解读

指标	含义	目标
gpu__time_duration.sum	Kernel 执行时间	越低越好
sm__warps_active.avg.pct_of_peak	活跃线程束比例	> 80%
gpu__dram_throughput	全局内存吞吐量	> 80%
l1tex__data_bank_conflicts	Bank conflict 次数	接近 0

Nsight Systems

Nsight Systems 是系统级性能分析工具，用于分析 kernel 的时间线和并发。

基本使用

bash

# 生成时间线报告
nsys profile ./benchmark

# 查看报告
nsys-ui ./report.nsys-rep

分析内容

Kernel 执行时间线
CPU-GPU 并发
CUDA API 调用
内存传输

性能优化方法

1. 占用率优化

占用率 = 活跃线程束 / 最大线程束

cuda

// 计算占用率
int threads_per_block = BLOCK_SIZE * BLOCK_SIZE;
int blocks_per_sm = max_threads_per_sm / threads_per_block;
int registers_per_thread = ...;  // 从 Nsight Compute 获取
int shared_mem_per_block = ...;

// 检查约束
assert(threads_per_block <= 1024);
assert(registers_per_thread * threads_per_block <= 65536);
assert(shared_mem_per_block <= 48 * 1024);  // 或 164KB for A100

2. 内存优化

检查内存吞吐量：

bash

ncu --metrics gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
    ./benchmark --kernel=tiled

如果吞吐量低，检查：

合并访存是否正确
共享内存使用是否充分
是否有 bank conflict

3. 计算优化

检查计算吞吐量：

bash

ncu --metrics sm__pipe_fma_cycles_active.avg.pct_of_peak \
    ./benchmark

如果计算吞吐量低：

增加每个线程的计算量（寄存器分块）
减少同步开销
利用 Tensor Core

AutoTuner 使用

本项目内置 AutoTuner 用于自动搜索最优参数：

cpp

#include "autotuner.h"

// 定义参数空间
AutoTuner tuner;
tuner.add_param("BLOCK_SIZE", {16, 32, 64, 128});
tuner.add_param("TILE_M", {4, 8, 16});
tuner.add_param("TILE_N", {4, 8, 16});

// 搜索最优配置
auto best = tuner.search(
    [](const Config& cfg) {
        return benchmark_gemm(cfg);
    }
);

std::cout << "Best config: " << best << std::endl;

性能基准

RTX 3080 参考性能 (1024×1024)

Kernel	时间 (ms)	TFLOPS	vs cuBLAS
Naive	15.2	0.14	10%
Tiled	7.6	0.28	20%
Coalesced	6.1	0.35	25%
Double Buffer	3.8	0.56	40%
Register Blocked	1.8	1.19	85%
Fused	1.9	1.12	80%
Vectorized	1.7	1.25	89%
cuBLAS	1.5	1.40	100%

性能分析要点

Naive → Tiled：共享内存减少全局访问
Tiled → Coalesced：合并访存提高吞吐
Coalesced → Double Buffer：延迟隐藏
Double Buffer → Register Blocked：计算强度提升（最大收益）
Register Blocked → Vectorized：向量化加载

常见问题排查

问题 1：性能不稳定

原因：

GPU 频率波动
热节流
系统负载

解决：

bash

# 检查 GPU 状态
nvidia-smi -q -d CLOCK,TEMPERATURE

# 固定 GPU 频率
sudo nvidia-smi -lgc 1710  # 锁定 GPU 时钟

问题 2：占用率低

原因：

线程块太大或太小
寄存器使用过多
共享内存使用过多

解决：

bash

# 查看资源使用
ncu --metrics launch__registers_per_thread,\
                launch__shared_memory_per_block \
    ./benchmark

问题 3：Bank Conflict

检测：

bash

ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
    ./benchmark

解决：

添加 padding
调整访问模式

性能分析 ​

Nsight Compute ​

基本使用 ​

关键指标 ​

指标解读 ​

Nsight Systems ​

基本使用 ​

分析内容 ​

性能优化方法 ​

1. 占用率优化 ​

2. 内存优化 ​

3. 计算优化 ​

AutoTuner 使用 ​

性能基准 ​

RTX 3080 参考性能 (1024×1024) ​

性能分析要点 ​

常见问题排查 ​

问题 1：性能不稳定 ​

问题 2：占用率低 ​

问题 3：Bank Conflict ​

参考资料 ​

性能分析

Nsight Compute

基本使用

关键指标

指标解读

Nsight Systems

基本使用

分析内容

性能优化方法

1. 占用率优化

2. 内存优化

3. 计算优化

AutoTuner 使用

性能基准

RTX 3080 参考性能 (1024×1024)

性能分析要点

常见问题排查

问题 1：性能不稳定

问题 2：占用率低

问题 3：Bank Conflict

参考资料