Memory Hierarchy

This document explains GPU memory hierarchy and its impact on GEMM optimization.

GPU Memory Hierarchy Overview

┌─────────────────────────────────────────────────────────────┐
│                      CPU Host Memory                         │
│                    (DDR4/DDR5, Large Capacity)               │
│                      Latency: ~100-300 ns                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              │ PCIe 4.0/5.0 (~32 GB/s)
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    GPU Global Memory                         │
│                    (GDDR6/HBM, 16-80 GB)                     │
│                      Latency: ~400-800 cycles                │
│                      Bandwidth: 400-900 GB/s                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              │
┌─────────────────────────────────────────────────────────────┐
│                      L2 Cache                                │
│                    (6-50 MB, Shared)                         │
│                      Latency: ~100-200 cycles                │
└─────────────────────────────────────────────────────────────┘
                              │
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Shared Memory                             │
│                    (48-164 KB/SM, User Managed)              │
│                      Latency: ~20-30 cycles                  │
│                      Bandwidth: ~Several TB/s                │
└─────────────────────────────────────────────────────────────┘
                              │
                              │
┌─────────────────────────────────────────────────────────────┐
│                   Register File                              │
│                    (64K 32-bit/SM)                           │
│                      Latency: ~1 cycle                       │
│                      Bandwidth: ~Dozens of TB/s              │
└─────────────────────────────────────────────────────────────┘

Memory Level Details

1. Global Memory

Characteristics:

Largest capacity (16-80 GB)
Highest latency (400-800 cycles)
Accessible by all threads
Communicates with Host via PCIe

Access Pattern:

cuda

// Global memory access
__global__ void kernel(float* global_data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = global_data[idx];  // Global memory load
    global_data[idx] = val * 2.0f; // Global memory store
}

Coalesced Access:

When threads in a warp access consecutive addresses, accesses can be coalesced:

cuda

// ✅ Coalesced: threads access consecutive addresses
int idx = threadIdx.x;
float val = data[idx];  // Coalesced into one 128-byte transaction

// ❌ Non-coalesced: threads access strided addresses
int idx = threadIdx.x * stride;  // stride > 1
float val = data[idx];  // Multiple independent transactions

2. Shared Memory

Characteristics:

User-managed cache
Low latency (~20-30 cycles)
High bandwidth (~Several TB/s)
Shared within thread block

Bank Conflict:

Shared memory is divided into 32 banks, each 4 bytes wide:

cuda

__shared__ float data[32][32];

// ✅ No bank conflict: different banks
float val = data[threadIdx.y][threadIdx.x];

// ❌ Bank conflict: same bank accessed by warp
float val = data[threadIdx.x][threadIdx.x];  // 32-way conflict

// ✅ Add padding to avoid conflict
__shared__ float data[32][33];  // +1 padding
float val = data[threadIdx.x][threadIdx.x];  // No conflict

Application in GEMM:

cuda

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load to shared memory
As[ty][tx] = A[row * K + k + tx];
Bs[ty][tx] = B[(k + ty) * N + col];

__syncthreads();  // Block synchronization

// Compute from shared memory
for (int i = 0; i < BLOCK_SIZE; i++) {
    sum += As[ty][i] * Bs[i][tx];
}

3. Registers

Characteristics:

Fastest storage (~1 cycle)
Thread-private
Limited quantity (64K 32-bit/SM)
Spilling degrades performance

Register Blocking Optimization:

cuda

// Each thread computes TILE_M × TILE_N output block
float regA[TILE_M];  // Register stores A's row
float regB[TILE_N];  // Register stores B's column
float regC[TILE_M][TILE_N] = {0};  // Accumulator

for (int k = 0; k < K; k += BLOCK_SIZE) {
    // Load to registers
    for (int i = 0; i < TILE_M; i++) {
        regA[i] = As[ty * TILE_M + i][tk];
    }
    for (int j = 0; j < TILE_N; j++) {
        regB[j] = Bs[tk][tx * TILE_N + j];
    }
    
    // Compute in registers (no memory access)
    for (int i = 0; i < TILE_M; i++) {
        for (int j = 0; j < TILE_N; j++) {
            regC[i][j] += regA[i] * regB[j];
        }
    }
}

Memory Access Optimization Strategies

Strategy 1: Data Reuse

Principle: Each loaded data participates in multiple computations

Naive GEMM:
- Each output element needs 2K memory accesses
- Total accesses: 2 × M × N × K

Tiled GEMM:
- Each data block loaded once, reused BLOCK_SIZE times
- Total accesses reduced to: 2 × M × N × K / BLOCK_SIZE

Strategy 2: Latency Hiding

Principle: Overlap computation with memory access

cuda

// Double buffering: load and compute overlap
__shared__ float buffer[2][BLOCK_SIZE][BLOCK_SIZE];
int current = 0, next = 1;

for (int k = 0; k < K; k += BLOCK_SIZE) {
    // Asynchronously load next block
    load_async(buffer[next], k + BLOCK_SIZE);
    
    // Compute current block
    compute(buffer[current]);
    
    // Synchronize and swap buffers
    __syncthreads();
    swap(current, next);
}

Strategy 3: Vectorized Loading

Principle: Use SIMD instructions to load multiple data at once

cuda

// Use float4 to load 4 floats at once
float4* A4 = reinterpret_cast<float4*>(A_shared);
float4 a = A4[tx];  // Load 4 floats at once

// Unrolled computation
sum0 += a.x * b.x;
sum1 += a.y * b.y;
sum2 += a.z * b.z;
sum3 += a.w * b.w;

Performance Analysis Tools

Nsight Compute Memory Analysis

bash

# Analyze memory throughput
ncu --metrics gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
    ./benchmark

# Analyze shared memory bank conflicts
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
    ./benchmark

Key Metrics

Metric	Meaning	Target
DRAM Throughput	Global memory throughput	> 80%
Shared Memory Bank Conflicts	Bank conflict count	Near 0
L2 Cache Hit Rate	L2 cache hit rate	Higher is better
Register Usage	Registers per thread	< 255

GPU Architecture Differences

Architecture	Shared Memory/SM	Registers/SM	L2 Cache	Features
Volta (V100)	96 KB	64K	6 MB	Introduced Tensor Core
Ampere (A100)	164 KB	64K	40 MB	Large L2, TF32
Hopper (H100)	228 KB	64K	50 MB	FP8, TMA
Ada (RTX 4090)	100 KB	64K	6 MB	Consumer

Optimization Recommendations:

Adjust BLOCK_SIZE for target architecture
Leverage architecture features (e.g., Tensor Core)
Use AutoTuner to automatically search optimal parameters

References

CUDA C Best Practices Guide - NVIDIA
Programming Massively Parallel Processors - Kirk & Hwu
CUDA Optimization Guide - NVIDIA

Memory Hierarchy ​

GPU Memory Hierarchy Overview ​

Memory Level Details ​

1. Global Memory ​

2. Shared Memory ​

3. Registers ​

Memory Access Optimization Strategies ​

Strategy 1: Data Reuse ​

Strategy 2: Latency Hiding ​

Strategy 3: Vectorized Loading ​

Performance Analysis Tools ​

Nsight Compute Memory Analysis ​

Key Metrics ​

GPU Architecture Differences ​

References ​

Memory Hierarchy

GPU Memory Hierarchy Overview

Memory Level Details

1. Global Memory

2. Shared Memory

3. Registers

Memory Access Optimization Strategies

Strategy 1: Data Reuse

Strategy 2: Latency Hiding

Strategy 3: Vectorized Loading

Performance Analysis Tools

Nsight Compute Memory Analysis

Key Metrics

GPU Architecture Differences

References