🇨🇳 简体中文
Tuning strategies, benchmarks, and best practices.
Table of Contents
Performance Overview
Benchmark Results
NVIDIA RTX 3090 (Ampere, 936 GB/s)
GPU Architecture Comparison
Kernel Selection Strategy
Kernel Types
Optimization Guide
1. Use Auto-Config (Recommended)
2. Format Selection
3. Memory Optimization
Profiling Tools
Troubleshooting
Performance Not as Expected?
GPU SpMV achieves high performance through intelligent kernel scheduling:
Target bandwidth utilization : > 60%
Measured peak : 70-75% (RTX 3090)
Scalability : Linear growth with matrix size
Benchmark Results
NVIDIA RTX 3090 (Ampere, 936 GB/s)
Matrix Size
Non-zeros
Kernel
Bandwidth
10K × 10K
500K
Vector CSR
70.2%
100K × 100K
5M
Merge Path
71.5%
1M × 1M
50M
Merge Path
70.8%
GPU Architecture Comparison
GPU Architecture
Theoretical BW
Measured Util
Volta (V100)
900 GB/s
~65%
Turing (RTX 2080)
448 GB/s
~68%
Ampere (RTX 3090)
936 GB/s
~70%
Ada Lovelace (RTX 4090)
1008 GB/s
~72%
Kernel Selection Strategy
1
2
3
4
5
6
7
8
9
10
11
Matrix Analysis
│
├── avg nnz/row < 4 ──→ Scalar CSR (1 thread/row)
│
├── avg nnz/row ≥ 4
│ │
│ ├── skewness < 10 ──→ Vector CSR (warp/row)
│ │
│ └── skewness ≥ 10 ──→ Merge Path (balanced)
│
└── ELL format ─────────→ ELL Kernel (coalesced)
Kernel Types
Kernel
Best For
Bandwidth
Scalar CSR
Very sparse
~40-50%
Vector CSR
Uniform dist
~65-75%
Merge Path
High skewness
~70-80%
ELL
Uniform rows
~80-90%
Optimization Guide
1. Use Auto-Config (Recommended)
1
2
3
// Library auto-selects optimal kernel
SpMVConfig config = spmv_auto_config ( csr );
SpMVResult result = spmv_csr ( csr , d_x , d_y , & config , n );
1
2
3
4
5
6
7
// Convert to ELL when row lengths are uniform
if ( row_length_variance < 0.2 ) {
ELLMatrix * ell = ell_create ( rows , cols , max_nnz_per_row );
ell_from_csr ( ell , csr );
ell_to_gpu ( ell );
// ELL often performs better
}
3. Memory Optimization
1
2
3
4
5
6
7
// ✅ Use RAII for automatic management
CudaBuffer < float > buffer ( n );
// ❌ Avoid manual allocation
float * ptr ;
cudaMalloc ( & ptr , n * sizeof ( float ));
// Easy to forget cudaFree
1
2
3
4
5
6
7
8
# Nsight Systems - overall analysis
nsys profile ./spmv_benchmark
# Nsight Compute - detailed kernel analysis
ncu --kernel-name spmv ./spmv_benchmark
# Built-in benchmark
./build-release/spmv_benchmark
Troubleshooting
Checklist:
✅ Using spmv_auto_config()
✅ Matrix transferred to GPU (csr_to_gpu)
✅ Input vectors on GPU
✅ Matrix large enough (>10K non-zeros)
Full benchmark data in benchmarks/ directory