vs CUTLASS
This document compares Mini-Inference Engine with NVIDIA CUTLASS.
CUTLASS Overview
CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source GEMM template library.
Features
| Feature | Description |
|---|---|
| Templated | Highly configurable kernel design |
| Tensor Core | Complete Tensor Core support |
| Code Quality | Production-grade, extremely high learning value |
| Continuous Updates | Follows latest GPU architectures |
Code Structure
cutlass/
├── include/
│ ├── gemm/ # GEMM core implementation
│ │ ├── kernel/ # Kernel implementation
│ │ ├── thread/ # Thread-level operations
│ │ └── warp/ # Warp-level operations
│ ├── arch/ # Architecture related
│ ├── transform/ # Data transformation
│ └── epilogue/ # Post-processing (fusion)
└── examples/ # Example codeRelationship with CUTLASS
Complexity Comparison
| Aspect | This Project | CUTLASS |
|---|---|---|
| Code Lines | ~3000 | ~50000+ |
| Template Usage | Minimal | Heavy |
| Config Options | ~10 | ~100+ |
| Learning Curve | Gentle | Steep |
Feature Comparison
| Feature | This Project | CUTLASS |
|---|---|---|
| FP32 GEMM | ✅ | ✅ |
| FP16 GEMM | ✅ | ✅ |
| INT8 GEMM | ❌ | ✅ |
| Tensor Core | ❌ | ✅ |
| Batch GEMM | ✅ | ✅ |
| Operator Fusion | ✅ (Simple) | ✅ (Complete) |
| Multi-GPU | ❌ | ✅ |
CUTLASS Core Concepts
1. Layered Abstraction
CUTLASS decomposes GEMM into multiple layers:
cpp
// Pseudocode showing layer structure
namespace cutlass::gemm {
// Threadblock level: computes one tile of C
class GemmKernel {
// Warp level: computes part of tile
using WarpIterators = ...;
// Thread level: computes part within warp
using ThreadIterators = ...;
};
}2. Template Parameters
CUTLASS uses many template parameters to configure kernel:
cpp
cutlass::gemm::device::Gemm<
float, // ElementA
cutlass::layout::RowMajor, // LayoutA
float, // ElementB
cutlass::layout::ColumnMajor, // LayoutB
float, // ElementC
cutlass::layout::RowMajor, // LayoutC
float, // ElementAccumulator
cutlass::arch::OpClassSimt, // OpClass
cutlass::arch::Sm80 // ArchTag
> gemm_op;3. Epilogue Fusion
CUTLASS's Epilogue mechanism supports operator fusion:
cpp
using Epilogue = cutlass::epilogue::thread::LinearCombination<
float, // Output type
4, // Elements per access
float, // Accumulator type
float // Scale bias type
>;Design Patterns to Learn
1. Layered Design
This project's four-layer architecture references CUTLASS design:
Application Layer → Benchmark / Tests
Engine Layer → InferenceEngine / Tensor
Kernel Layer → 7-Level GEMM
Infrastructure → MemoryPool / StreamManager2. Parameterized Design
This project's AutoTuner references CUTLASS parameterization approach:
cpp
struct GemmConfig {
int BLOCK_M;
int BLOCK_N;
int BLOCK_K;
int THREAD_M;
int THREAD_N;
};3. Performance Analysis
Learn CUTLASS profiling methods:
cpp
// CUTLASS built-in profiling
cutlass::profiler::GemmProfiler<
GemmKernel,
ProblemSize
> profiler;
profiler.run();Learning Path
Recommended Order
Week 1-2: This Project
│
│ Understand GEMM optimization basics
│ Master shared memory, register blocking
│
▼
Week 3-4: CUTLASS Examples
│
│ Read basic examples
│ Understand template parameters
│
▼
Week 5-6: CUTLASS Source
│
│ Deep dive into kernel implementation
│ Learn warp-level operations
│
▼
Week 7+: CUTLASS Advanced Features
│
│ Tensor Core
│ Epilogue fusion
│ Multi-GPU parallelismCUTLASS Learning Resources
Official Resources
Recommended Examples
| Example | Learning Focus |
|---|---|
0_basic_gemm | Basic usage |
10_planar_complex | Complex layouts |
15_gemm_universal | Universal interface |
23_gemm_grouped | Batched GEMM |
27_gemm_with_epilogue | Operator fusion |
Summary
| Aspect | This Project | CUTLASS |
|---|---|---|
| Positioning | Introductory teaching | Production library |
| Complexity | Low | High |
| Learning Curve | Gentle | Steep |
| Production Ready | Educational use | Yes |
Best Practice: Use this project as pre-requisite learning material for CUTLASS, helping understand CUTLASS design philosophy.