Skip to content

Memory Hierarchy

This document explains GPU memory hierarchy and its impact on GEMM optimization.


GPU Memory Hierarchy Overview

┌─────────────────────────────────────────────────────────────┐
│                      CPU Host Memory                         │
│                    (DDR4/DDR5, Large Capacity)               │
│                      Latency: ~100-300 ns                    │
└─────────────────────────────────────────────────────────────┘

                              │ PCIe 4.0/5.0 (~32 GB/s)

┌─────────────────────────────────────────────────────────────┐
│                    GPU Global Memory                         │
│                    (GDDR6/HBM, 16-80 GB)                     │
│                      Latency: ~400-800 cycles                │
│                      Bandwidth: 400-900 GB/s                 │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                      L2 Cache                                │
│                    (6-50 MB, Shared)                         │
│                      Latency: ~100-200 cycles                │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                    Shared Memory                             │
│                    (48-164 KB/SM, User Managed)              │
│                      Latency: ~20-30 cycles                  │
│                      Bandwidth: ~Several TB/s                │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                   Register File                              │
│                    (64K 32-bit/SM)                           │
│                      Latency: ~1 cycle                       │
│                      Bandwidth: ~Dozens of TB/s              │
└─────────────────────────────────────────────────────────────┘

Memory Level Details

1. Global Memory

Characteristics:

  • Largest capacity (16-80 GB)
  • Highest latency (400-800 cycles)
  • Accessible by all threads
  • Communicates with Host via PCIe

Access Pattern:

cuda
// Global memory access
__global__ void kernel(float* global_data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = global_data[idx];  // Global memory load
    global_data[idx] = val * 2.0f; // Global memory store
}

Coalesced Access:

When threads in a warp access consecutive addresses, accesses can be coalesced:

cuda
// ✅ Coalesced: threads access consecutive addresses
int idx = threadIdx.x;
float val = data[idx];  // Coalesced into one 128-byte transaction

// ❌ Non-coalesced: threads access strided addresses
int idx = threadIdx.x * stride;  // stride > 1
float val = data[idx];  // Multiple independent transactions

2. Shared Memory

Characteristics:

  • User-managed cache
  • Low latency (~20-30 cycles)
  • High bandwidth (~Several TB/s)
  • Shared within thread block

Bank Conflict:

Shared memory is divided into 32 banks, each 4 bytes wide:

cuda
__shared__ float data[32][32];

// ✅ No bank conflict: different banks
float val = data[threadIdx.y][threadIdx.x];

// ❌ Bank conflict: same bank accessed by warp
float val = data[threadIdx.x][threadIdx.x];  // 32-way conflict

// ✅ Add padding to avoid conflict
__shared__ float data[32][33];  // +1 padding
float val = data[threadIdx.x][threadIdx.x];  // No conflict

Application in GEMM:

cuda
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load to shared memory
As[ty][tx] = A[row * K + k + tx];
Bs[ty][tx] = B[(k + ty) * N + col];

__syncthreads();  // Block synchronization

// Compute from shared memory
for (int i = 0; i < BLOCK_SIZE; i++) {
    sum += As[ty][i] * Bs[i][tx];
}

3. Registers

Characteristics:

  • Fastest storage (~1 cycle)
  • Thread-private
  • Limited quantity (64K 32-bit/SM)
  • Spilling degrades performance

Register Blocking Optimization:

cuda
// Each thread computes TILE_M × TILE_N output block
float regA[TILE_M];  // Register stores A's row
float regB[TILE_N];  // Register stores B's column
float regC[TILE_M][TILE_N] = {0};  // Accumulator

for (int k = 0; k < K; k += BLOCK_SIZE) {
    // Load to registers
    for (int i = 0; i < TILE_M; i++) {
        regA[i] = As[ty * TILE_M + i][tk];
    }
    for (int j = 0; j < TILE_N; j++) {
        regB[j] = Bs[tk][tx * TILE_N + j];
    }
    
    // Compute in registers (no memory access)
    for (int i = 0; i < TILE_M; i++) {
        for (int j = 0; j < TILE_N; j++) {
            regC[i][j] += regA[i] * regB[j];
        }
    }
}

Memory Access Optimization Strategies

Strategy 1: Data Reuse

Principle: Each loaded data participates in multiple computations

Naive GEMM:
- Each output element needs 2K memory accesses
- Total accesses: 2 × M × N × K

Tiled GEMM:
- Each data block loaded once, reused BLOCK_SIZE times
- Total accesses reduced to: 2 × M × N × K / BLOCK_SIZE

Strategy 2: Latency Hiding

Principle: Overlap computation with memory access

cuda
// Double buffering: load and compute overlap
__shared__ float buffer[2][BLOCK_SIZE][BLOCK_SIZE];
int current = 0, next = 1;

for (int k = 0; k < K; k += BLOCK_SIZE) {
    // Asynchronously load next block
    load_async(buffer[next], k + BLOCK_SIZE);
    
    // Compute current block
    compute(buffer[current]);
    
    // Synchronize and swap buffers
    __syncthreads();
    swap(current, next);
}

Strategy 3: Vectorized Loading

Principle: Use SIMD instructions to load multiple data at once

cuda
// Use float4 to load 4 floats at once
float4* A4 = reinterpret_cast<float4*>(A_shared);
float4 a = A4[tx];  // Load 4 floats at once

// Unrolled computation
sum0 += a.x * b.x;
sum1 += a.y * b.y;
sum2 += a.z * b.z;
sum3 += a.w * b.w;

Performance Analysis Tools

Nsight Compute Memory Analysis

bash
# Analyze memory throughput
ncu --metrics gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed \
    ./benchmark

# Analyze shared memory bank conflicts
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum \
    ./benchmark

Key Metrics

MetricMeaningTarget
DRAM ThroughputGlobal memory throughput> 80%
Shared Memory Bank ConflictsBank conflict countNear 0
L2 Cache Hit RateL2 cache hit rateHigher is better
Register UsageRegisters per thread< 255

GPU Architecture Differences

ArchitectureShared Memory/SMRegisters/SML2 CacheFeatures
Volta (V100)96 KB64K6 MBIntroduced Tensor Core
Ampere (A100)164 KB64K40 MBLarge L2, TF32
Hopper (H100)228 KB64K50 MBFP8, TMA
Ada (RTX 4090)100 KB64K6 MBConsumer

Optimization Recommendations:

  • Adjust BLOCK_SIZE for target architecture
  • Leverage architecture features (e.g., Tensor Core)
  • Use AutoTuner to automatically search optimal parameters

References

  1. CUDA C Best Practices Guide - NVIDIA
  2. Programming Massively Parallel Processors - Kirk & Hwu
  3. CUDA Optimization Guide - NVIDIA

MIT License | CUDA GEMM optimization tutorial