Skip to content

Performance Analysis

This document presents the benchmarking methodology, performance results, and optimization analysis for TensorCraft-HPC.


Benchmarking Methodology

Environment

ComponentSpecification
GPUNVIDIA A100 80GB
CUDA12.4
Driver550.x
OSUbuntu 22.04
CompilerGCC 11.4 / NVCC 12.4

Measurement Protocol

  1. Warm-up: 10 iterations before measurement
  2. Samples: 100 iterations per measurement
  3. Metrics: Mean, standard deviation, min, max
  4. Validation: Numerical correctness verified against reference

Baseline References

OperationReference Library
GEMMcuBLAS
AttentioncuDNN / FlashAttention
NormalizationcuDNN
ConvolutioncuDNN
SparsecuSPARSE

GEMM Performance

FP16 Tensor Core (A100)

Matrix SizeTensorCraftcuBLASRatio
512×5120.15ms0.14ms93%
1024×10240.82ms0.71ms87%
2048×20483.1ms2.8ms89%
4096×409612.1ms11.0ms91%
8192×819295.2ms88.0ms92%

Scaling Across Architectures

GPUSM4096² FP16cuBLASRatio
V1007014.2ms12.8ms89%
A1008012.1ms11.0ms91%
H100908.5ms7.8ms92%

Optimization Stage Analysis


FlashAttention Performance

Memory Footprint Comparison

Sequence LengthStandard AttentionFlashAttentionReduction
1024512 MB64 MB
20482 GB128 MB16×
40968 GB256 MB32×
819232 GB512 MB64×

Latency Comparison

ConfigTensorCraftcuDNNRatio
32×128×640.12ms0.10ms85%
64×256×640.45ms0.38ms84%
128×512×641.8ms1.5ms83%

Normalization Performance

OperationTensorCraftcuDNNRatio
LayerNorm (4096×4096)0.08ms0.07ms95%
RMSNorm (4096×4096)0.06ms0.05ms95%
Fused LayerNorm + Dropout0.09ms0.08ms94%

Convolution Performance

ConfigTensorCraftcuDNNRatio
Conv2D 3×3, 256×2560.42ms0.35ms78%
Conv2D 1×1, 512×5120.28ms0.22ms78%
Depthwise 3×30.15ms0.12ms80%

Performance Gap

Convolution kernels use Im2Col optimization. Further gains require Winograd algorithm and auto-tuning, which are planned for future releases.


Sparse Operations Performance

OperationFormatTensorCraftcuSPARSERatio
SpMVCSR0.35ms0.30ms88%
SpMMCSR1.2ms1.0ms85%

Performance Model

Roofline Analysis

The performance of GEMM is bounded by:

  1. Memory Bandwidth: For small matrices
  2. Compute Throughput: For large matrices

The transition point occurs at:

M_critical = (Memory_BW) / (Compute_TP / sizeof(T))

For A100 with FP16:

  • Memory BW: 2039 GB/s
  • Tensor Core TP: 312 TFLOPS
  • M_critical ≈ 256

Arithmetic Intensity

OperationArithmetic IntensityBound
GEMMO(N)Compute
FlashAttentionO(N)Compute
LayerNormO(1)Memory
SoftmaxO(1)Memory

Optimization Techniques

Memory Coalescing

cpp
// Bad: Strided access
float val = input[threadIdx.x * stride];

// Good: Coalesced access
float val = input[threadIdx.x];

Shared Memory Banking

cpp
// Avoid bank conflicts
__shared__ float tile[32][33];  // +1 for padding
tile[ty][tx] = ...;  // No bank conflicts

Warp-Level Primitives

cpp
// Efficient reduction within warp
float sum = warp_reduce_sum(val);

Benchmark Reproduction

bash
# Build benchmarks
cmake --preset dev
cmake --build --preset dev

# Run GEMM benchmark
./build/benchmarks/gemm_benchmark --benchmark_filter="FP16"

# Run all benchmarks
ctest --preset dev -R benchmark

Performance Regression

TensorCraft-HPC includes automated performance regression testing:

yaml
# .github/workflows/benchmark.yml
- name: Run benchmarks
  run: |
    ./build/benchmarks/gemm_benchmark --benchmark_format=json > results.json
    python scripts/check_regression.py results.json baseline.json

Thresholds:

  • Warning: >5% regression
  • Failure: >10% regression

Released under the Apache 2.0 License.