Academy
The academy section teaches readers how to read the repository as a system, not just how to compile it.
What you learn here
- how the repository layers kernels, memory primitives, and hardware capability checks
- how the optimization path progresses from naive kernels to Tensor Core aware variants
- how to evaluate benchmark claims with the right methodology in mind
Suggested reading order
Optimization path
Optimization Path
Progressive optimization from naive to Tensor Core implementation
1
Naive2
Tiled3
Double Buffer4
Tensor Core1
Naive
Direct triple loop implementation
Global memory accessNo parallelism optimization
2
Tiled
Shared memory blocking
Block-level tilingShared memory reuseCoalesced access
3
Double Buffer
Pipeline memory access
PrefetchingLatency hidingWarp synchronization
4
Tensor Core
WMMA hardware acceleration
WMMA instructionsMixed precisionMaximum throughput