Skip to content

CUDA Optimization

GPU optimization techniques used in Mini-ImagePipe.

Memory Hierarchy

Memory Types

MemoryLocationSpeedScope
GlobalDRAM~800 GB/sAll threads
SharedOn-chip~2 TB/sBlock
ConstantCache~1 TB/sAll threads
RegistersOn-chip~10 TB/sThread

Shared Memory Tiling

For convolution operations, we load tile + halo into shared memory:

Tile Configuration

cpp
// For 5×5 kernel with 16×16 tiles
constexpr int TILE = 16;
constexpr int HALO = 2;  // (kernel_size - 1) / 2

__shared__ float sharedMem[TILE + 2*HALO][TILE + 2*HALO];

Halo Region Loading

+------------------+
|  Halo (top)      |
+--+------------+--+
|H |   Tile     | H|
|a |   Data     | a|
|l |            | l|
|o |            | o|
+--+------------+--+
|  Halo (bottom)   |
+------------------+

Constant Memory

Sobel kernels are stored in constant memory for broadcast:

cpp
__constant__ float c_sobelX[9] = {-1, 0, 1, -2, 0, 2, -1, 0, 1};
__constant__ float c_sobelY[9] = {-1, -2, -1, 0, 0, 0, 1, 2, 1};

// All threads in a warp access the same address → single memory fetch

Stream Concurrency

Independent Execution

cpp
// Stream 0: Operator A
kernelA<<<grid, block, 0, stream0>>>(...);
cudaEventRecord(eventA, stream0);

// Stream 1: Operator B (independent)
kernelB<<<grid, block, 0, stream1>>>(...);

// Stream 0: Operator C (depends on A)
cudaStreamWaitEvent(stream0, eventA);
kernelC<<<grid, block, 0, stream0>>>(...);

Occupancy Optimization

Thread Block Sizing

cpp
// Target: maximize occupancy
// Consider: registers per thread, shared memory per block

dim3 blockSize(16, 16);  // 256 threads per block
dim3 gridSize((width + 15) / 16, (height + 15) / 16);

Register Pressure

cpp
// Reduce register usage for higher occupancy
__launch_bounds__(256, 4)  // 256 threads, min 4 blocks per SM
__global__ void myKernel(...) {
    // ...
}

Performance Guidelines

  1. Coalesced access: Ensure threads access consecutive memory addresses
  2. Shared memory bank conflicts: Pad arrays to avoid bank conflicts
  3. Divergent warps: Minimize branching within warps
  4. Memory transfers: Use pinned memory for async transfers

References

Released under the MIT License.