Getting Started
Welcome to HPC-AI-Optimization-Lab — a comprehensive educational laboratory for high-performance CUDA kernels.
What You'll Learn
This documentation covers the full spectrum of CUDA kernel optimization, from fundamental memory access patterns to cutting-edge Tensor Core usage:
- Memory Optimization - Coalesced access, vectorization, shared memory patterns
- Reduction Operations - Warp shuffle, block reduction, online algorithms
- GEMM Optimization - 7-step journey from naive kernels to Tensor Core-oriented implementations
- FlashAttention - IO-aware attention mechanism
- CUDA 13 Features - Experimental Hopper architecture features
Prerequisites
- CUDA Toolkit 12.4+
- CMake 3.24+
- C++20 compatible compiler
- NVIDIA GPU (Compute Capability 7.0+)
Quick Start
Clone and Build
bash
# Clone the repository
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab
# Configure and build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run tests
ctest --test-dir build --output-on-failure1
2
3
4
5
6
7
8
9
10
2
3
4
5
6
7
8
9
10
Build with Examples
bash
cmake -S . -B build -DBUILD_EXAMPLES=ON
cmake --build build --target relu_example gemm_benchmark
./build/examples/relu_example
./build/examples/gemm_benchmark1
2
3
4
2
3
4
Python Bindings
bash
cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
cmake --build build
export PYTHONPATH="$(pwd)/build/python:$PYTHONPATH"
python examples/python/basic_usage.py1
2
3
4
2
3
4
Documentation Structure
| Section | Description | Difficulty |
|---|---|---|
| Memory Optimization | Coalesced access, vectorization, shared memory | ⭐⭐ |
| Reduction Optimization | Warp shuffle, online softmax, LayerNorm | ⭐⭐⭐ |
| GEMM Optimization | 7-step matrix multiplication journey | ⭐⭐⭐⭐ |
| FlashAttention | IO-aware attention mechanism | ⭐⭐⭐⭐ |
| CUDA 13 Features | Hopper architecture: TMA, Clusters, FP8 | ⭐⭐⭐⭐⭐ |
API References
- API Reference - consolidated C++ / CUDA / Python reference notes
- Architecture Overview - Design patterns and module organization
Next Steps
Choose your learning path:
- Beginner → Start with Memory Optimization
- Intermediate → Jump to GEMM Optimization
- Advanced → Explore FlashAttention or CUDA 13 Features