vs cuBLAS

本文档详细对比 Mini-Inference Engine 与 NVIDIA cuBLAS。

cuBLAS 简介

cuBLAS 是 NVIDIA 官方提供的 BLAS (Basic Linear Algebra Subprograms) 库，是 GPU 矩阵运算的事实标准。

特点

特点	说明
性能	代表 GPU 矩阵运算的最高性能
覆盖	支持所有 NVIDIA GPU 架构
精度	FP32/FP16/TF32/INT8/FP64
硬件	Tensor Core 加速
API	完整的 BLAS Level 1/2/3

使用方式

cpp

#include <cublas_v2.h>

// 创建 handle
cublasHandle_t handle;
cublasCreate(&handle);

// 执行 GEMM
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle,
    CUBLAS_OP_N, CUBLAS_OP_N,
    M, N, K,
    &alpha, d_A, K, d_B, N,
    &beta, d_C, N);

// 销毁 handle
cublasDestroy(handle);

本项目与 cuBLAS 的关系

定位差异

方面	cuBLAS	本项目
目标	生产级性能	教学理解
代码	闭源	开源可读
优化	自动选择最优	手动逐级展示
API	标准BLAS	简化接口

性能对比 (RTX 3080, 1024×1024)

Kernel	时间 (ms)	vs cuBLAS
L1 Naive	15.23	10.2%
L2 Tiled	7.61	20.4%
L3 Coalesced	6.12	25.3%
L4 Double Buffer	3.85	40.8%
L5 Register Blocked	1.82	86.5%
L6 Fused	1.91	80.2%
L7 Vectorized	1.71	91.2%
cuBLAS	1.56	100%

cuBLAS 的优化技术

cuBLAS 使用的技术包括但不限于：

1. 多 kernel 变体

cuBLAS 为不同矩阵规模维护多个 kernel 变体，运行时自动选择：

cpp

// 伪代码
if (M < 128 && N < 128) {
    gemm_small();
} else if (M > 4096 && N > 4096) {
    gemm_large();
} else {
    gemm_medium();
}

2. Tensor Core

在支持的 GPU 上，cuBLAS 使用 Tensor Core：

FP16: 4×4×4 matrix multiply-accumulate per clock
TF32: 3×3×4 matrix multiply-accumulate per clock

3. 流水线化

cuBLAS 使用软件流水线隐藏延迟：

cuda

// 伪代码
#pragma unroll
for (int i = 0; i < PIPELINE_DEPTH; i++) {
    load_async(next_data[i]);
    compute(current_data[i]);
}

4. 分块参数调优

针对每个 GPU 架构，cuBLAS 有预调优的分块参数：

GPU 架构	BM	BN	BK	TM	TN
Volta	128	128	32	8	8
Ampere	128	128	16	8	4
Hopper	128	128	16	8	4

本项目可学习的点

1. 性能分析方法

使用 Nsight Compute 分析 cuBLAS：

bash

# 分析 cuBLAS kernel
ncu --set full ./benchmark --kernel=cublas

2. 参数调优思路

本项目 AutoTuner 的设计参考了 cuBLAS 的调优思路：

cpp

// 搜索最优参数
AutoTuner tuner;
tuner.add_param("BLOCK_M", {64, 128, 256});
tuner.add_param("BLOCK_N", {64, 128, 256});
tuner.add_param("BLOCK_K", {8, 16, 32});

3. 多版本策略

可以为不同场景维护多个 kernel 版本：

cpp

if (K < 64) {
    gemm_small_k();
} else {
    gemm_general();
}

何时使用 cuBLAS

使用 cuBLAS：

生产环境
需要最高性能
标准 BLAS 操作

使用本项目：

学习 GPU 编程
理解优化原理
自定义 kernel 开发
面试准备

总结

方面	cuBLAS	本项目
性能	最高	~85-90%
可读性	闭源	完全开源
学习价值	低	高
生产可用	是	教学用途

推荐： 先通过本项目理解原理，再阅读 CUTLASS 源码，最后在生产中使用 cuBLAS。

vs cuBLAS ​

cuBLAS 简介 ​

特点 ​

使用方式 ​

本项目与 cuBLAS 的关系 ​

定位差异 ​

性能对比 (RTX 3080, 1024×1024) ​

cuBLAS 的优化技术 ​

1. 多 kernel 变体 ​

2. Tensor Core ​

3. 流水线化 ​

4. 分块参数调优 ​

本项目可学习的点 ​

1. 性能分析方法 ​

2. 参数调优思路 ​

3. 多版本策略 ​

何时使用 cuBLAS ​

总结 ​