Memory Hierarchy
This document explains GPU memory hierarchy and its impact on GEMM optimization.
GPU Memory Hierarchy Overview
┌─────────────────────────────────────────────────────────────┐
│ CPU Host Memory │
│ (DDR4/DDR5, Large Capacity) │
│ Latency: ~100-300 ns │
└─────────────────────────────────────────────────────────────┘
│
│ PCIe 4.0/5.0 (~32 GB/s)
▼
┌─────────────────────────────────────────────────────────────┐
│ GPU Global Memory │
│ (GDDR6/HBM, 16-80 GB) │
│ Latency: ~400-800 cycles │
│ Bandwidth: 400-900 GB/s │
└─────────────────────────────────────────────────────────────┘
│
│
┌─────────────────────────────────────────────────────────────┐
│ L2 Cache │
│ (6-50 MB, Shared) │
│ Latency: ~100-200 cycles │
└─────────────────────────────────────────────────────────────┘
│
│
┌─────────────────────────────────────────────────────────────┐
│ Shared Memory │
│ (48-164 KB/SM, User Managed) │
│ Latency: ~20-30 cycles │
│ Bandwidth: ~Several TB/s │
└─────────────────────────────────────────────────────────────┘
│
│
┌─────────────────────────────────────────────────────────────┐
│ Register File │
│ (64K 32-bit/SM) │
│ Latency: ~1 cycle │
│ Bandwidth: ~Dozens of TB/s │
└─────────────────────────────────────────────────────────────┘Memory Level Details
1. Global Memory
Characteristics:
- Largest capacity (16-80 GB)
- Highest latency (400-800 cycles)
- Accessible by all threads
- Communicates with Host via PCIe
Access Pattern:
cuda
// Global memory access
__global__ void kernel(float* global_data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = global_data[idx]; // Global memory load
global_data[idx] = val * 2.0f; // Global memory store
}Coalesced Access:
When threads in a warp access consecutive addresses, accesses can be coalesced:
cuda
// ✅ Coalesced: threads access consecutive addresses
int idx = threadIdx.x;
float val = data[idx]; // Coalesced into one 128-byte transaction
// ❌ Non-coalesced: threads access strided addresses
int idx = threadIdx.x * stride; // stride > 1
float val = data[idx]; // Multiple independent transactions2. Shared Memory
Characteristics:
- User-managed cache
- Low latency (~20-30 cycles)
- High bandwidth (~Several TB/s)
- Shared within thread block
Bank Conflict:
Shared memory is divided into 32 banks, each 4 bytes wide:
cuda
__shared__ float data[32][32];
// ✅ No bank conflict: different banks
float val = data[threadIdx.y][threadIdx.x];
// ❌ Bank conflict: same bank accessed by warp
float val = data[threadIdx.x][threadIdx.x]; // 32-way conflict
// ✅ Add padding to avoid conflict
__shared__ float data[32][33]; // +1 padding
float val = data[threadIdx.x][threadIdx.x]; // No conflictApplication in GEMM:
cuda
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load to shared memory
As[ty][tx] = A[row * K + k + tx];
Bs[ty][tx] = B[(k + ty) * N + col];
__syncthreads(); // Block synchronization
// Compute from shared memory
for (int i = 0; i < BLOCK_SIZE; i++) {
sum += As[ty][i] * Bs[i][tx];
}3. Registers
Characteristics:
- Fastest storage (~1 cycle)
- Thread-private
- Limited quantity (64K 32-bit/SM)
- Spilling degrades performance
Register Blocking Optimization:
cuda
// Each thread computes TILE_M × TILE_N output block
float regA[TILE_M]; // Register stores A's row
float regB[TILE_N]; // Register stores B's column
float regC[TILE_M][TILE_N] = {0}; // Accumulator
for (int k = 0; k < K; k += BLOCK_SIZE) {
// Load to registers
for (int i = 0; i < TILE_M; i++) {
regA[i] = As[ty * TILE_M + i][tk];
}
for (int j = 0; j < TILE_N; j++) {
regB[j] = Bs[tk][tx * TILE_N + j];
}
// Compute in registers (no memory access)
for (int i = 0; i < TILE_M; i++) {
for (int j = 0; j < TILE_N; j++) {
regC[i][j] += regA[i] * regB[j];
}
}
}Memory Access Optimization Strategies
Strategy 1: Data Reuse
Principle: Each loaded data participates in multiple computations
Naive GEMM:
- Each output element needs 2K memory accesses
- Total accesses: 2 × M × N × K
Tiled GEMM:
- Each data block loaded once, reused BLOCK_SIZE times
- Total accesses reduced to: 2 × M × N × K / BLOCK_SIZEStrategy 2: Latency Hiding
Principle: Overlap computation with memory access
cuda
// Double buffering: load and compute overlap
__shared__ float buffer[2][BLOCK_SIZE][BLOCK_SIZE];
int current = 0, next = 1;
for (int k = 0; k < K; k += BLOCK_SIZE) {
// Asynchronously load next block
load_async(buffer[next], k + BLOCK_SIZE);
// Compute current block
compute(buffer[current]);
// Synchronize and swap buffers
__syncthreads();
swap(current, next);
}Strategy 3: Vectorized Loading
Principle: Use SIMD instructions to load multiple data at once
cuda
// Use float4 to load 4 floats at once
float4* A4 = reinterpret_cast<float4*>(A_shared);
float4 a = A4[tx]; // Load 4 floats at once
// Unrolled computation
sum0 += a.x * b.x;
sum1 += a.y * b.y;
sum2 += a.z * b.z;
sum3 += a.w * b.w;Performance Analysis Tools
Nsight Compute Memory Analysis
bash
# Analyze memory throughput
ncu --metrics gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
./benchmark
# Analyze shared memory bank conflicts
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
./benchmarkKey Metrics
| Metric | Meaning | Target |
|---|---|---|
| DRAM Throughput | Global memory throughput | > 80% |
| Shared Memory Bank Conflicts | Bank conflict count | Near 0 |
| L2 Cache Hit Rate | L2 cache hit rate | Higher is better |
| Register Usage | Registers per thread | < 255 |
GPU Architecture Differences
| Architecture | Shared Memory/SM | Registers/SM | L2 Cache | Features |
|---|---|---|---|---|
| Volta (V100) | 96 KB | 64K | 6 MB | Introduced Tensor Core |
| Ampere (A100) | 164 KB | 64K | 40 MB | Large L2, TF32 |
| Hopper (H100) | 228 KB | 64K | 50 MB | FP8, TMA |
| Ada (RTX 4090) | 100 KB | 64K | 6 MB | Consumer |
Optimization Recommendations:
- Adjust BLOCK_SIZE for target architecture
- Leverage architecture features (e.g., Tensor Core)
- Use AutoTuner to automatically search optimal parameters
References
- CUDA C Best Practices Guide - NVIDIA
- Programming Massively Parallel Processors - Kirk & Hwu
- CUDA Optimization Guide - NVIDIA