Skip to content

Benchmarks

Performance comparison between Mini-OpenCV (GPU) and OpenCV (CPU).

Test Environment

ComponentSpecification
GPUNVIDIA RTX 4090 (24GB)
CPUIntel i9-13900K
CUDA12.4
OpenCV4.8 (CPU)
Image Size3840×2160 (4K)

Performance Comparison

Processing Time

Speedup Factor

Detailed Results

Convolution Operations

OperationImage SizeCPU (ms)GPU (ms)Speedup
Gaussian Blur (5×5)4K45.21.237.7×
Gaussian Blur (15×15)4K120.53.831.7×
Sobel Edge4K38.10.942.3×
Custom Kernel (7×7)4K65.32.131.1×

Filter Operations

OperationImage SizeCPU (ms)GPU (ms)Speedup
Median Filter (3×3)4K28.42.511.4×
Bilateral Filter4K180.54.837.6×
Box Filter (5×5)4K25.20.831.5×
Sharpen4K42.11.138.3×

Geometric Operations

OperationImage SizeCPU (ms)GPU (ms)Speedup
Resize (2× up)4K18.30.630.5×
Resize (0.5× down)4K8.20.327.3×
Rotate 90°4K5.40.227.0×
Flip Horizontal4K2.10.121.0×

Histogram Operations

OperationImage SizeCPU (ms)GPU (ms)Speedup
Histogram Calculation4K3.20.1521.3×
Histogram Equalization4K12.30.341.0×
Otsu Threshold4K5.80.2523.2×

CUDA Optimization Techniques

1. Shared Memory Tiling

For convolution operations, we use shared memory tiling to reduce global memory access:

cpp
// Kernel uses shared memory to cache image data + halo region
extern __shared__ float sharedMem[];
// Each thread loads data to shared memory
// Convolution computed from fast shared memory

Benefit: ~10× speedup over naive global memory access

2. Atomic Operations

Histogram calculations use atomic operations for parallel reduction:

cpp
__global__ void histogramKernel(...) {
    atomicAdd(&histogram[value], 1);
}

Benefit: Parallel histogram without race conditions

3. Texture Memory

Image resize operations leverage texture memory for hardware interpolation:

cpp
cudaBindTextureToArray(texRef, imageArray);
tex2D(texRef, x, y); // Hardware bilinear interpolation

Benefit: Free hardware interpolation, reduced kernel complexity

4. Multi-Stream Execution

Pipeline operations use multiple CUDA streams for overlap:

cpp
cudaStream_t streams[N];
for (int i = 0; i < N; i++) {
    cudaMemcpyAsync(..., streams[i]);
    kernel<<<..., streams[i]>>>(...);
}

Benefit: Overlap compute and transfer, higher throughput

Reproducing Benchmarks

bash
# Build with benchmarks
cmake -S . -B build -DBUILD_BENCHMARKS=ON
cmake --build build -j$(nproc)

# Run benchmarks
./build/bin/benchmark_convolution
./build/bin/benchmark_filters
./build/bin/benchmark_geometric

Methodology

See Methodology for detailed testing procedures and hardware specifications.

Released under the MIT License.