vs cuBLAS
This document compares Mini-Inference Engine with NVIDIA cuBLAS in detail.
cuBLAS Overview
cuBLAS is NVIDIA's official BLAS (Basic Linear Algebra Subprograms) library, representing the de facto standard for GPU matrix operations.
Features
| Feature | Description |
|---|---|
| Performance | Peak GPU matrix operation performance |
| Coverage | Supports all NVIDIA GPU architectures |
| Precision | FP32/FP16/TF32/INT8/FP64 |
| Hardware | Tensor Core acceleration |
| API | Complete BLAS Level 1/2/3 |
Usage
#include <cublas_v2.h>
// Create handle
cublasHandle_t handle;
cublasCreate(&handle);
// Execute GEMM
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_N,
M, N, K,
&alpha, d_A, K, d_B, N,
&beta, d_C, N);
// Destroy handle
cublasDestroy(handle);Relationship with cuBLAS
Positioning Differences
| Aspect | cuBLAS | This Project |
|---|---|---|
| Goal | Production performance | Educational understanding |
| Code | Closed source | Open source, readable |
| Optimization | Auto-selects optimal | Manual step-by-step display |
| API | Standard BLAS | Simplified interface |
Performance Comparison (RTX 3080, 1024×1024)
| Kernel | Time (ms) | vs cuBLAS |
|---|---|---|
| L1 Naive | 15.23 | 10.2% |
| L2 Tiled | 7.61 | 20.4% |
| L3 Coalesced | 6.12 | 25.3% |
| L4 Double Buffer | 3.85 | 40.8% |
| L5 Register Blocked | 1.82 | 86.5% |
| L6 Fused | 1.91 | 80.2% |
| L7 Vectorized | 1.71 | 91.2% |
| cuBLAS | 1.56 | 100% |
cuBLAS Optimization Techniques
cuBLAS uses techniques including but not limited to:
1. Multiple Kernel Variants
cuBLAS maintains multiple kernel variants for different matrix sizes, auto-selecting at runtime:
// Pseudocode
if (M < 128 && N < 128) {
gemm_small();
} else if (M > 4096 && N > 4096) {
gemm_large();
} else {
gemm_medium();
}2. Tensor Core
On supported GPUs, cuBLAS uses Tensor Core:
FP16: 4×4×4 matrix multiply-accumulate per clock
TF32: 3×3×4 matrix multiply-accumulate per clock3. Pipelining
cuBLAS uses software pipelining to hide latency:
// Pseudocode
#pragma unroll
for (int i = 0; i < PIPELINE_DEPTH; i++) {
load_async(next_data[i]);
compute(current_data[i]);
}4. Tuned Block Parameters
For each GPU architecture, cuBLAS has pre-tuned block parameters:
| GPU Arch | BM | BN | BK | TM | TN |
|---|---|---|---|---|---|
| Volta | 128 | 128 | 32 | 8 | 8 |
| Ampere | 128 | 128 | 16 | 8 | 4 |
| Hopper | 128 | 128 | 16 | 8 | 4 |
Learnable Points from This Project
1. Performance Analysis Methods
Use Nsight Compute to analyze cuBLAS:
# Analyze cuBLAS kernel
ncu --set full ./benchmark --kernel=cublas2. Parameter Tuning Approach
This project's AutoTuner design references cuBLAS tuning approach:
// Search optimal parameters
AutoTuner tuner;
tuner.add_param("BLOCK_M", {64, 128, 256});
tuner.add_param("BLOCK_N", {64, 128, 256});
tuner.add_param("BLOCK_K", {8, 16, 32});3. Multi-version Strategy
Can maintain multiple kernel versions for different scenarios:
if (K < 64) {
gemm_small_k();
} else {
gemm_general();
}When to Use cuBLAS
Use cuBLAS:
- Production environments
- Need peak performance
- Standard BLAS operations
Use This Project:
- Learning GPU programming
- Understanding optimization principles
- Custom kernel development
- Interview preparation
Summary
| Aspect | cuBLAS | This Project |
|---|---|---|
| Performance | Highest | ~85-90% |
| Readability | Closed source | Fully open |
| Learning value | Low | High |
| Production ready | Yes | Educational use |
Recommendation: Understand principles through this project first, then read CUTLASS source code, finally use cuBLAS in production.