Skip to content

Academy

The academy section teaches readers how to read the repository as a system, not just how to compile it.

What you learn here

  • how the repository layers kernels, memory primitives, and hardware capability checks
  • how the optimization path progresses from naive kernels to Tensor Core aware variants
  • how to evaluate benchmark claims with the right methodology in mind

Suggested reading order

  1. Quick start
  2. Architecture lessons
  3. Modern C++ and CUDA
  4. Benchmarking
  5. Examples

Optimization path

Optimization Path

Progressive optimization from naive to Tensor Core implementation

1
Naive
2
Tiled
3
Double Buffer
4
Tensor Core
1

Naive

Direct triple loop implementation

Global memory accessNo parallelism optimization
5%
2

Tiled

Shared memory blocking

Block-level tilingShared memory reuseCoalesced access
45%
3

Double Buffer

Pipeline memory access

PrefetchingLatency hidingWarp synchronization
75%
4

Tensor Core

WMMA hardware acceleration

WMMA instructionsMixed precisionMaximum throughput
92%

Released under the MIT License.