Skip to content

Benchmark Methodology

How we measure and report performance.

Test Environment

Hardware

ComponentSpecification
GPUNVIDIA RTX 4090 (24GB VRAM)
CPUIntel i9-13900K (24 cores)
RAM64GB DDR5-6000
StorageNVMe SSD

Software

ComponentVersion
OSUbuntu 22.04 LTS
CUDA12.4
Driver550.x
OpenCV4.8.0
GCC11.4

Measurement Methodology

Warm-up

Before each measurement:

  1. Run operation 10 times to warm up GPU
  2. Clear GPU cache with cudaDeviceSynchronize()

Timing

cpp
auto start = std::chrono::high_resolution_clock::now();

// Run operation N times
for (int i = 0; i < iterations; i++) {
    operation();
    cudaDeviceSynchronize();
}

auto end = std::chrono::high_resolution_clock::now();
auto avg_time = (end - start) / iterations;

Metrics

  • Latency: Average time per operation
  • Throughput: Operations per second
  • Speedup: CPU time / GPU time

Image Sizes

NameDimensionsPixels
HD1280×720921K
FHD1920×10802.1M
4K3840×21608.3M
8K7680×432033.2M

Reproducibility

All benchmarks are available in the benchmarks/ directory:

bash
mkdir build && cd build
cmake -DBUILD_BENCHMARKS=ON ..
make -j$(nproc)
./bin/benchmark_convolution

Notes

  • CPU benchmarks use single-threaded OpenCV
  • GPU benchmarks include host-device transfer time
  • Results may vary based on GPU temperature and clock speed

Released under the MIT License.