Benchmarking Guide

How to measure and analyze kernel performance in TensorCraft-HPC.

Running Benchmarks

Built-in Benchmarks

bash

# Build with benchmarks enabled
cmake --preset release
cmake --build --preset release --parallel 2

# Run GEMM benchmark
./build/release/benchmarks/gemm_benchmark

# Run attention benchmark
./build/release/benchmarks/attention_benchmark

# Run convolution benchmark
./build/release/benchmarks/conv_benchmark

Benchmark Output

GEMM Benchmark (A100, CUDA 12.4)
--------------------------------
Size        Time (ms)   TFLOPs    vs cuBLAS
1024x1024   0.42        5.1       92%
2048x2048   2.1         8.2       89%
4096x4096   12.4        11.1      85%
8192x8192   85.2        12.9      82%

Profiling Tools

Nsight Compute

For detailed kernel analysis:

bash

# Profile a specific kernel
ncu --set full -o profile_report ./build/release/benchmarks/gemm_benchmark

# View report
ncu-ui profile_report.ncu-rep

Key metrics to examine:

Memory Throughput — Achieved vs. peak bandwidth
Compute Throughput — FLOP utilization
Warp Execution Efficiency — Active warp percentage
Shared Memory Bank Conflicts — Serialization events

Nsight Systems

For timeline analysis:

bash

# System-wide profiling
nsys profile -o timeline ./build/release/benchmarks/gemm_benchmark

# View timeline
nsys-ui timeline.qdrep

Performance Metrics

Roofline Model

Key Metrics

Metric	Formula	Target
Arithmetic Intensity	FLOPs / Bytes	> 100 for GEMM
Memory Efficiency	Achieved / Peak BW	> 80%
Compute Efficiency	Achieved / Peak FLOPS	> 80%
Occupancy	Active Warps / Max Warps	50-100%

Benchmarking Best Practices

Warm-up Runs

Always include warm-up runs to eliminate initialization overhead:

cpp

// Warm-up
for (int i = 0; i < 10; i++) kernel();

// Timed runs
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 100; i++) kernel();
auto end = std::chrono::high_resolution_clock::now();

Multiple Runs

Run benchmarks multiple times and report statistics:

cpp

std::vector<double> times;
for (int run = 0; run < 100; run++) {
    auto t = measure_kernel();
    times.push_back(t);
}
// Report mean, std, min, max

Avoid Micro-benchmarks Pitfalls

Don't benchmark sizes that fit entirely in cache
Use realistic input sizes for your use case
Consider batch effects in production

Comparing to Baselines

cuBLAS for GEMM

cpp

#include <cublas_v2.h>

// cuBLAS baseline
cublasHandle_t handle;
cublasCreate(&handle);

float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
            N, M, K, &alpha, B, N, A, K, &beta, C, N);

cuDNN for Convolution

cpp

#include <cudnn.h>

// cuDNN baseline setup
cudnnHandle_t cudnn;
cudnnCreate(&cudnn);
// ... configure and run

Benchmarking Guide

Running Benchmarks

Built-in Benchmarks

Benchmark Output

Profiling Tools

Nsight Compute

Nsight Systems

Performance Metrics

Roofline Model

Key Metrics

Benchmarking Best Practices

Comparing to Baselines

cuBLAS for GEMM

cuDNN for Convolution

Interpreting Results

Good Performance Indicators

Warning Signs

Benchmarking Guide ​

Running Benchmarks ​

Built-in Benchmarks ​

Benchmark Output ​

Profiling Tools ​

Nsight Compute ​

Nsight Systems ​

Performance Metrics ​

Roofline Model ​

Key Metrics ​

Benchmarking Best Practices ​

Comparing to Baselines ​

cuBLAS for GEMM ​

cuDNN for Convolution ​

Interpreting Results ​

Good Performance Indicators ​

Warning Signs ​

Benchmarking Guide

Running Benchmarks

Built-in Benchmarks

Benchmark Output

Profiling Tools

Nsight Compute

Nsight Systems

Performance Metrics

Roofline Model

Key Metrics

Benchmarking Best Practices

Comparing to Baselines

cuBLAS for GEMM

cuDNN for Convolution

Interpreting Results

Good Performance Indicators

Warning Signs