学院
学院部分的目标不是只教你怎么编译,而是教你如何把这个仓库当作一个系统来阅读。
你会在这里学到什么
- 仓库如何组织 kernels、memory primitive 与硬件能力检测
- 优化路径如何从朴素版本演进到 Tensor Core aware 实现
- 应该怎样结合正确的方法论去评估 benchmark 结论
推荐阅读顺序
优化路径
Optimization Path
Progressive optimization from naive to Tensor Core implementation
1
Naive2
Tiled3
Double Buffer4
Tensor Core1
Naive
Direct triple loop implementation
Global memory accessNo parallelism optimization
2
Tiled
Shared memory blocking
Block-level tilingShared memory reuseCoalesced access
3
Double Buffer
Pipeline memory access
PrefetchingLatency hidingWarp synchronization
4
Tensor Core
WMMA hardware acceleration
WMMA instructionsMixed precisionMaximum throughput