Skip to content

vs cuBLAS

This document compares Mini-Inference Engine with NVIDIA cuBLAS in detail.


cuBLAS Overview

cuBLAS is NVIDIA's official BLAS (Basic Linear Algebra Subprograms) library, representing the de facto standard for GPU matrix operations.

Features

FeatureDescription
PerformancePeak GPU matrix operation performance
CoverageSupports all NVIDIA GPU architectures
PrecisionFP32/FP16/TF32/INT8/FP64
HardwareTensor Core acceleration
APIComplete BLAS Level 1/2/3

Usage

cpp
#include <cublas_v2.h>

// Create handle
cublasHandle_t handle;
cublasCreate(&handle);

// Execute GEMM
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle,
    CUBLAS_OP_N, CUBLAS_OP_N,
    M, N, K,
    &alpha, d_A, K, d_B, N,
    &beta, d_C, N);

// Destroy handle
cublasDestroy(handle);

Relationship with cuBLAS

Positioning Differences

AspectcuBLASThis Project
GoalProduction performanceEducational understanding
CodeClosed sourceOpen source, readable
OptimizationAuto-selects optimalManual step-by-step display
APIStandard BLASSimplified interface

Performance Comparison (RTX 3080, 1024×1024)

KernelTime (ms)vs cuBLAS
L1 Naive15.2310.2%
L2 Tiled7.6120.4%
L3 Coalesced6.1225.3%
L4 Double Buffer3.8540.8%
L5 Register Blocked1.8286.5%
L6 Fused1.9180.2%
L7 Vectorized1.7191.2%
cuBLAS1.56100%

cuBLAS Optimization Techniques

cuBLAS uses techniques including but not limited to:

1. Multiple Kernel Variants

cuBLAS maintains multiple kernel variants for different matrix sizes, auto-selecting at runtime:

cpp
// Pseudocode
if (M < 128 && N < 128) {
    gemm_small();
} else if (M > 4096 && N > 4096) {
    gemm_large();
} else {
    gemm_medium();
}

2. Tensor Core

On supported GPUs, cuBLAS uses Tensor Core:

FP16: 4×4×4 matrix multiply-accumulate per clock
TF32: 3×3×4 matrix multiply-accumulate per clock

3. Pipelining

cuBLAS uses software pipelining to hide latency:

cuda
// Pseudocode
#pragma unroll
for (int i = 0; i < PIPELINE_DEPTH; i++) {
    load_async(next_data[i]);
    compute(current_data[i]);
}

4. Tuned Block Parameters

For each GPU architecture, cuBLAS has pre-tuned block parameters:

GPU ArchBMBNBKTMTN
Volta1281283288
Ampere1281281684
Hopper1281281684

Learnable Points from This Project

1. Performance Analysis Methods

Use Nsight Compute to analyze cuBLAS:

bash
# Analyze cuBLAS kernel
ncu --set full ./benchmark --kernel=cublas

2. Parameter Tuning Approach

This project's AutoTuner design references cuBLAS tuning approach:

cpp
// Search optimal parameters
AutoTuner tuner;
tuner.add_param("BLOCK_M", {64, 128, 256});
tuner.add_param("BLOCK_N", {64, 128, 256});
tuner.add_param("BLOCK_K", {8, 16, 32});

3. Multi-version Strategy

Can maintain multiple kernel versions for different scenarios:

cpp
if (K < 64) {
    gemm_small_k();
} else {
    gemm_general();
}

When to Use cuBLAS

Use cuBLAS:

  • Production environments
  • Need peak performance
  • Standard BLAS operations

Use This Project:

  • Learning GPU programming
  • Understanding optimization principles
  • Custom kernel development
  • Interview preparation

Summary

AspectcuBLASThis Project
PerformanceHighest~85-90%
ReadabilityClosed sourceFully open
Learning valueLowHigh
Production readyYesEducational use

Recommendation: Understand principles through this project first, then read CUTLASS source code, finally use cuBLAS in production.

MIT License | CUDA GEMM optimization tutorial