Skip to content

Design Philosophy

Core Principles

1. Memory-Bandwidth Awareness

SpMV is fundamentally memory-bound. Our design prioritizes:

Key Insight: On modern GPUs, memory bandwidth is the bottleneck. Our kernels are designed to maximize memory throughput, not compute throughput.

2. Adaptive Computation

No single kernel is optimal for all matrices. Our adaptive selection is based on:

Matrix CharacteristicOptimal KernelSelection Criterion
avg_nnz < 4Scalar CSRLow parallelism per row
uniform distributionVector CSRConsistent warp utilization
high skewnessMerge PathPerfect work partitioning
ELL-convertibleELL KernelCoalesced memory access

Selection Algorithm:

cpp
SpMVKernel select_kernel(const CSRMatrix* csr) {
    double avg_nnz = (double)csr->nnz / csr->num_rows;
    
    if (avg_nnz < 4.0) {
        return KERNEL_SCALAR_CSR;  // Low parallelism
    }
    
    double skewness = compute_skewness(csr);
    
    if (skewness < 10.0) {
        return KERNEL_VECTOR_CSR;  // Balanced rows
    }
    
    return KERNEL_MERGE_PATH;      // Irregular patterns
}

3. Minimal Governance

The project now favors a smaller maintenance surface:

  • Keep the public API narrow and focused on core SpMV operations.
  • Put validation in tests and examples instead of parallel process frameworks.
  • Avoid shipping showcase-only modules inside the library itself.

Kernel Design Trade-offs

Scalar CSR vs Vector CSR

AspectScalar CSRVector CSR
ParallelismOne thread per rowOne warp per row
Memory AccessUncoalescedPartially coalesced
Best ForVery sparse matricesUniform sparsity
OverheadLowMedium

Merge Path Algorithm

The Merge Path algorithm provides perfect load balancing for irregular matrices:

ELL Format

For matrices with uniform row lengths, ELL format enables fully coalesced memory access:

Column-Major Layout:
values[k * num_rows + i] = A[i][col[k]]

Memory Access Pattern:
Thread i reads values[0..num_cols-1] * num_rows + i
→ Consecutive threads access consecutive memory

Error Handling Philosophy

We use semantic error codes instead of exceptions:

cpp
typedef enum {
    SPMV_SUCCESS = 0,
    SPMV_ERROR_NULL_POINTER,
    SPMV_ERROR_INVALID_DIMENSIONS,
    SPMV_ERROR_CUDA_MALLOC,
    SPMV_ERROR_CUDA_MEMCPY,
    // ...
} SpMVError;

Benefits:

  • Performance: No exception overhead
  • Interoperability: C-compatible API
  • Debugging: Explicit error propagation

RAII Resource Management

All GPU resources are managed via CudaBuffer<T>:

cpp
template<typename T>
class CudaBuffer {
public:
    CudaBuffer(size_t size);
    ~CudaBuffer();  // Automatic cudaFree
    
    T* device_ptr();
    void copy_from_host(const T* src);
    void copy_to_host(T* dst);
    
private:
    T* d_ptr_;
    size_t size_;
};

This ensures no memory leaks even in error paths.

MIT License