Skip to content

Benchmarks

This section presents performance benchmarks for TensorCraft-HPC kernels compared against NVIDIA's optimized libraries (cuBLAS, cuDNN, cuSPARSE).

Overview

All benchmarks are measured on:

ParameterValue
GPUNVIDIA A100 80GB
CUDA Version12.4
Data TypeFP16 (Tensor Core)
MeasurementsAverage of 100 runs

Performance Summary

KernelReferenceRelative Performance
GEMM (FP16)cuBLAS92%
FlashAttentioncuDNN85%
LayerNormcuDNN95%
Conv2DcuDNN78%
SpMV (CSR)cuSPARSE88%

Detailed Benchmarks

Benchmarking Philosophy

TensorCraft-HPC prioritizes readability and educational value over raw performance. Our benchmarks serve to:

  1. Validate correctness — Ensure optimized versions produce accurate results
  2. Demonstrate progress — Show improvement from naive to optimized
  3. Guide optimization — Identify bottlenecks and optimization opportunities

Performance vs Readability

While we strive for competitive performance, we sometimes choose clearer code over marginal speed improvements. The goal is learning, not beating cuBLAS.

Running Your Own Benchmarks

bash
# Build benchmarks
cmake --preset dev
cmake --build --preset dev

# Run GEMM benchmark
./build/dev/benchmarks/gemm_benchmark

# Run all benchmarks
ctest --preset dev -L benchmark

Benchmark Configuration

Benchmarks can be configured via environment variables:

bash
# Set matrix size for GEMM benchmark
export TENSORCRAFT_BENCH_SIZE=4096

# Set number of warmup runs
export TENSORCRAFT_BENCH_WARMUP=10

# Set number of measured runs
export TENSORCRAFT_BENCH_RUNS=100

Released under the Apache 2.0 License.