GEMM Kernels
General Matrix Multiply (GEMM) is the fundamental operation in deep learning. TensorCraft-HPC provides progressive optimization paths from naive to Tensor Core implementations.
Overview
GEMM computes C = α × A × B + β × C where:
Ais an M×K matrixBis a K×N matrixCis an M×N matrixαandβare scalar coefficients
Why GEMM Matters
GEMM accounts for 80-90% of computation in modern neural networks. Understanding its optimization is crucial for high-performance AI systems.
Optimization Path
TensorCraft-HPC provides 4 levels of GEMM optimization:
| Level | Name | Key Technique | Performance |
|---|---|---|---|
| 1 | Naive | Direct triple loop | ~5% cuBLAS |
| 2 | Tiled | Shared memory blocking | ~45% cuBLAS |
| 3 | Double Buffer | Pipeline memory access | ~75% cuBLAS |
| 4 | Tensor Core | WMMA instructions | ~92% cuBLAS |
API Reference
Core Functions
gemm<T>(A, B, C, M, N, K, alpha, beta)
Performs general matrix multiplication.
cpp
template<typename T>
void gemm(
const T* A, // Input matrix A (M×K)
const T* B, // Input matrix B (K×N)
T* C, // Output matrix C (M×N)
size_t M, // Rows of A and C
size_t N, // Columns of B and C
size_t K, // Columns of A / Rows of B
T alpha = 1.0, // Scalar multiplier for A×B
T beta = 0.0 // Scalar multiplier for C
);Template Parameters:
T— Data type:float,double,half(FP16), or__nv_bfloat16
Example:
cpp
#include "tensorcraft/kernels/gemm.hpp"
#include "tensorcraft/memory/tensor.hpp"
using namespace tensorcraft;
// Create matrices
FloatTensor A({4096, 4096});
FloatTensor B({4096, 4096});
FloatTensor C({4096, 4096});
// Initialize A and B with data...
// Compute C = A × B
kernels::gemm(A.data(), B.data(), C.data(), 4096, 4096, 4096);Specialized Variants
gemm_fp16 — FP16 Tensor Core
Optimized for FP16 computation using Tensor Cores.
cpp
void gemm_fp16(
const half* A,
const half* B,
half* C,
size_t M, size_t N, size_t K
);Requirements:
- SM70+ (Volta or later)
- CUDA 11.0+
gemm_batched — Batched GEMM
Computes multiple independent GEMM operations.
cpp
template<typename T>
void gemm_batched(
const T* const A[], // Array of A matrices
const T* const B[], // Array of B matrices
T* const C[], // Array of C matrices
size_t batch_count,
size_t M, size_t N, size_t K
);Performance Benchmarks
A100 80GB, FP16 Tensor Core
| Matrix Size | TensorCraft | cuBLAS | Ratio |
|---|---|---|---|
| 512×512 | 0.15ms | 0.14ms | 93% |
| 1024×1024 | 0.82ms | 0.71ms | 87% |
| 2048×2048 | 3.1ms | 2.8ms | 89% |
| 4096×4096 | 12.1ms | 11.0ms | 91% |
| 8192×8192 | 95.2ms | 88.0ms | 92% |
Scaling Across Architectures
| GPU | SM | 4096² FP16 | cuBLAS | Ratio |
|---|---|---|---|---|
| V100 | 70 | 14.2ms | 12.8ms | 89% |
| A100 | 80 | 12.1ms | 11.0ms | 91% |
| H100 | 90 | 8.5ms | 7.8ms | 92% |
Usage Examples
Basic Usage
cpp
#include "tensorcraft/kernels/gemm.hpp"
// FP32 GEMM
tensorcraft::kernels::gemm(A_f32, B_f32, C_f32, M, N, K);
// FP16 GEMM (Tensor Core)
tensorcraft::kernels::gemm_fp16(A_f16, B_f16, C_f16, M, N, K);With Python Bindings
python
import tensorcraft_ops as tc
import numpy as np
# Create matrices
A = np.random.randn(4096, 4096).astype(np.float16)
B = np.random.randn(4096, 4096).astype(np.float16)
# GPU-accelerated GEMM
C = tc.gemm(A, B)Batched Processing
cpp
#include "tensorcraft/kernels/gemm.hpp"
std::vector<const half*> A_batch(batch_size);
std::vector<const half*> B_batch(batch_size);
std::vector<half*> C_batch(batch_size);
// Initialize batch pointers...
tensorcraft::kernels::gemm_batched(
A_batch.data(), B_batch.data(), C_batch.data(),
batch_size, M, N, K
);Implementation Details
Memory Layout
All matrices are expected in row-major order:
A[M×K]: A[0,0], A[0,1], ..., A[0,K-1], A[1,0], ...
B[K×N]: B[0,0], B[0,1], ..., B[0,N-1], B[1,0], ...
C[M×N]: C[0,0], C[0,1], ..., C[0,N-1], C[1,0], ...Thread Block Configuration
| Optimization | Block Size | Tile Size |
|---|---|---|
| Tiled | 256 | 32×32 |
| Double Buffer | 256 | 32×32 × 2 |
| Tensor Core | 128 | 64×64 (WMMA) |
Shared Memory Usage
- Tiled: 2 × 32 × 32 × sizeof(T) per block
- Double Buffer: 4 × 32 × 32 × sizeof(T) per block
- Tensor Core: 2 × 64 × 64 × sizeof(T) per block
References
- CUTLASS — NVIDIA's C++ templates for GEMM
- cuBLAS Documentation — Reference implementation
- Tensor Core Programming Guide
Next Steps
- GEMM Tutorial — Build GEMM from scratch
- GEMM Benchmarks — Detailed performance analysis