vs cuBLAS

This document compares Mini-Inference Engine with NVIDIA cuBLAS in detail.

cuBLAS Overview

cuBLAS is NVIDIA's official BLAS (Basic Linear Algebra Subprograms) library, representing the de facto standard for GPU matrix operations.

Features

Feature	Description
Performance	Peak GPU matrix operation performance
Coverage	Supports all NVIDIA GPU architectures
Precision	FP32/FP16/TF32/INT8/FP64
Hardware	Tensor Core acceleration
API	Complete BLAS Level 1/2/3

Usage

cpp

#include <cublas_v2.h>

// Create handle
cublasHandle_t handle;
cublasCreate(&handle);

// Execute GEMM
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle,
    CUBLAS_OP_N, CUBLAS_OP_N,
    M, N, K,
    &alpha, d_A, K, d_B, N,
    &beta, d_C, N);

// Destroy handle
cublasDestroy(handle);

Relationship with cuBLAS

Positioning Differences

Aspect	cuBLAS	This Project
Goal	Production performance	Educational understanding
Code	Closed source	Open source, readable
Optimization	Auto-selects optimal	Manual step-by-step display
API	Standard BLAS	Simplified interface

Performance Comparison (RTX 3080, 1024×1024)

Kernel	Time (ms)	vs cuBLAS
L1 Naive	15.23	10.2%
L2 Tiled	7.61	20.4%
L3 Coalesced	6.12	25.3%
L4 Double Buffer	3.85	40.8%
L5 Register Blocked	1.82	86.5%
L6 Fused	1.91	80.2%
L7 Vectorized	1.71	91.2%
cuBLAS	1.56	100%

cuBLAS Optimization Techniques

cuBLAS uses techniques including but not limited to:

1. Multiple Kernel Variants

cuBLAS maintains multiple kernel variants for different matrix sizes, auto-selecting at runtime:

cpp

// Pseudocode
if (M < 128 && N < 128) {
    gemm_small();
} else if (M > 4096 && N > 4096) {
    gemm_large();
} else {
    gemm_medium();
}

2. Tensor Core

On supported GPUs, cuBLAS uses Tensor Core:

FP16: 4×4×4 matrix multiply-accumulate per clock
TF32: 3×3×4 matrix multiply-accumulate per clock

3. Pipelining

cuBLAS uses software pipelining to hide latency:

cuda

// Pseudocode
#pragma unroll
for (int i = 0; i < PIPELINE_DEPTH; i++) {
    load_async(next_data[i]);
    compute(current_data[i]);
}

4. Tuned Block Parameters

For each GPU architecture, cuBLAS has pre-tuned block parameters:

GPU Arch	BM	BN	BK	TM	TN
Volta	128	128	32	8	8
Ampere	128	128	16	8	4
Hopper	128	128	16	8	4

Learnable Points from This Project

1. Performance Analysis Methods

Use Nsight Compute to analyze cuBLAS:

bash

# Analyze cuBLAS kernel
ncu --set full ./benchmark --kernel=cublas

2. Parameter Tuning Approach

This project's AutoTuner design references cuBLAS tuning approach:

cpp

// Search optimal parameters
AutoTuner tuner;
tuner.add_param("BLOCK_M", {64, 128, 256});
tuner.add_param("BLOCK_N", {64, 128, 256});
tuner.add_param("BLOCK_K", {8, 16, 32});

3. Multi-version Strategy

Can maintain multiple kernel versions for different scenarios:

cpp

if (K < 64) {
    gemm_small_k();
} else {
    gemm_general();
}

When to Use cuBLAS

Use cuBLAS:

Production environments
Need peak performance
Standard BLAS operations

Use This Project:

Learning GPU programming
Understanding optimization principles
Custom kernel development
Interview preparation

Summary

Aspect	cuBLAS	This Project
Performance	Highest	~85-90%
Readability	Closed source	Fully open
Learning value	Low	High
Production ready	Yes	Educational use

Recommendation: Understand principles through this project first, then read CUTLASS source code, finally use cuBLAS in production.

vs cuBLAS ​

cuBLAS Overview ​

Features ​

Usage ​

Relationship with cuBLAS ​

Positioning Differences ​

Performance Comparison (RTX 3080, 1024×1024) ​

cuBLAS Optimization Techniques ​

1. Multiple Kernel Variants ​

2. Tensor Core ​

3. Pipelining ​

4. Tuned Block Parameters ​

Learnable Points from This Project ​

1. Performance Analysis Methods ​

2. Parameter Tuning Approach ​

3. Multi-version Strategy ​

When to Use cuBLAS ​

Summary ​