Skip to content

Performance Analysis

Benchmark Methodology

Test Environment

ComponentSpecification
GPUNVIDIA RTX 3090 (Ampere, SM 8.6)
Theoretical Bandwidth936 GB/s
CUDA Version12.0
CPUAMD Ryzen 9 5900X
OSUbuntu 22.04 LTS

Metrics

  • Execution Time: Median of 100 runs after 10 warmup iterations
  • Effective Bandwidth: (bytes_read + bytes_written) / time
  • Utilization: effective_bandwidth / theoretical_bandwidth

Matrix Test Suite

Matrix TypeDescriptionGeneration Method
DiagonalNon-zeros only on diagonalSynthetic
UniformRandom uniform distributionsprand(n, n, density)
Power-LawScale-free distributionsprand(n, n, density, 'power')
BandBanded matrixSynthetic
Real-WorldSuiteSparse matricesDownloaded

Kernel Performance Comparison

By Matrix Pattern

PatternSizeNNZScalarVectorMergeELL
Diagonal100K100K37.2%69.1%72.4%74.8%
Uniform100K5M41.5%71.8%70.9%82.3%
Power-Law100K5M32.1%45.6%69.2%34.7%
Band100K5M28.4%64.9%58.1%41.2%

Key Observations:

  1. ELL excels on uniform matrices (82.3%) due to coalesced access
  2. Merge Path is most robust across irregular patterns
  3. Scalar CSR is only viable for very sparse matrices

Performance Visualization

Uniform Matrix (100K × 100K)
ELL Kernel
82.3%
Vector CSR
71.8%
Merge Path
70.9%
Scalar CSR
41.5%
Power-Law Matrix (100K × 100K)
Merge Path
69.2%
Vector CSR
45.6%
ELL Kernel
34.7%
Scalar CSR
32.1%

By Matrix Size

SizeNNZScalarVectorMergeELL
10K × 10K500K42.1%70.2%68.5%78.3%
100K × 100K5M36.7%68.7%71.5%73.7%
1M × 1M50M34.8%65.5%70.8%71.2%
10M × 10M500M33.2%62.1%69.4%68.9%

Scaling Analysis:

  • All kernels maintain >60% utilization at scale
  • Vector CSR shows slight degradation at 10M (L2 cache pressure)
  • Merge Path maintains consistent performance

Kernel Selection Accuracy

The auto-selection algorithm achieves optimal or near-optimal selection in 95%+ of cases:

Selection Accuracy by Pattern

PatternCorrect SelectionPerformance vs. Optimal
Diagonal98%99.2%
Uniform96%98.5%
Power-Law94%97.1%
Mixed92%96.3%

Memory Access Analysis

Coalescing Efficiency

KernelAvg. Threads per TransactionEfficiency
Scalar CSR1.2Low
Vector CSR8.4Medium
Merge Path12.1High
ELL Kernel16.0Perfect

L2 Cache Hit Rate

KernelHit RateNotes
Scalar CSR45%Poor locality
Vector CSR72%Moderate reuse
Merge Path68%Balanced
ELL Kernel85%Excellent locality

Comparison with Reference Implementations

vs. cuSPARSE

MatrixGPU SpMVcuSPARSESpeedup
Uniform 100K71.5%68.2%1.05×
Power-Law 100K69.2%52.1%1.33×
Real-World (webbase)67.8%61.4%1.10×

Advantages:

  • Better on irregular matrices (Merge Path algorithm)
  • Automatic kernel selection (no manual tuning)
  • Open source (full transparency)

vs. Generic SpMV

MatrixGPU SpMVGenericSpeedup
Uniform 100K71.5%35.2%2.03×
Power-Law 100K69.2%28.7%2.41×

Performance Optimization Tips

1. Matrix Format Selection

cpp
// For uniform matrices, convert to ELL
if (is_uniform(csr)) {
    ELLMatrix* ell = csr_to_ell(csr);
    // Use ELL kernel for better performance
}

2. Batch Processing

For multiple SpMV operations, reuse the configuration:

cpp
SpMVConfig config = spmv_auto_config(csr);

for (int i = 0; i < num_iterations; i++) {
    spmv_csr(csr, x[i], y[i], &config, n);
}

3. Memory Pre-allocation

Pre-allocate output vectors to avoid repeated allocations:

cpp
CudaBuffer<float> y(num_rows);  // Allocate once
for (auto& x : inputs) {
    spmv_csr(csr, x, y.device_ptr(), &config, n);
    // Process y...
}

Benchmark Reproduction

To reproduce the library build and collect your own timings:

bash
# Clone and build
git clone https://github.com/AICL-Lab/gpu-spmv.git
cd gpu-spmv
cmake --preset release
cmake --build --preset release

After that, profile the exact spmv_csr or spmv_ell call path you care about inside your own driver or application. The repository no longer ships a dedicated benchmark executable because keeping measurement logic outside the core library makes the maintenance surface smaller.

MIT License