Skip to content

vs CUTLASS

This document compares Mini-Inference Engine with NVIDIA CUTLASS.


CUTLASS Overview

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source GEMM template library.

Features

FeatureDescription
TemplatedHighly configurable kernel design
Tensor CoreComplete Tensor Core support
Code QualityProduction-grade, extremely high learning value
Continuous UpdatesFollows latest GPU architectures

Code Structure

cutlass/
├── include/
│   ├── gemm/           # GEMM core implementation
│   │   ├── kernel/     # Kernel implementation
│   │   ├── thread/     # Thread-level operations
│   │   └── warp/       # Warp-level operations
│   ├── arch/           # Architecture related
│   ├── transform/      # Data transformation
│   └── epilogue/       # Post-processing (fusion)
└── examples/           # Example code

Relationship with CUTLASS

Complexity Comparison

AspectThis ProjectCUTLASS
Code Lines~3000~50000+
Template UsageMinimalHeavy
Config Options~10~100+
Learning CurveGentleSteep

Feature Comparison

FeatureThis ProjectCUTLASS
FP32 GEMM
FP16 GEMM
INT8 GEMM
Tensor Core
Batch GEMM
Operator Fusion✅ (Simple)✅ (Complete)
Multi-GPU

CUTLASS Core Concepts

1. Layered Abstraction

CUTLASS decomposes GEMM into multiple layers:

cpp
// Pseudocode showing layer structure
namespace cutlass::gemm {

// Threadblock level: computes one tile of C
class GemmKernel {
    // Warp level: computes part of tile
    using WarpIterators = ...;
    
    // Thread level: computes part within warp
    using ThreadIterators = ...;
};

}

2. Template Parameters

CUTLASS uses many template parameters to configure kernel:

cpp
cutlass::gemm::device::Gemm<
    float,                          // ElementA
    cutlass::layout::RowMajor,      // LayoutA
    float,                          // ElementB
    cutlass::layout::ColumnMajor,   // LayoutB
    float,                          // ElementC
    cutlass::layout::RowMajor,      // LayoutC
    float,                          // ElementAccumulator
    cutlass::arch::OpClassSimt,     // OpClass
    cutlass::arch::Sm80             // ArchTag
> gemm_op;

3. Epilogue Fusion

CUTLASS's Epilogue mechanism supports operator fusion:

cpp
using Epilogue = cutlass::epilogue::thread::LinearCombination<
    float,          // Output type
    4,              // Elements per access
    float,          // Accumulator type
    float           // Scale bias type
>;

Design Patterns to Learn

1. Layered Design

This project's four-layer architecture references CUTLASS design:

Application Layer  →  Benchmark / Tests
Engine Layer       →  InferenceEngine / Tensor
Kernel Layer       →  7-Level GEMM
Infrastructure     →  MemoryPool / StreamManager

2. Parameterized Design

This project's AutoTuner references CUTLASS parameterization approach:

cpp
struct GemmConfig {
    int BLOCK_M;
    int BLOCK_N;
    int BLOCK_K;
    int THREAD_M;
    int THREAD_N;
};

3. Performance Analysis

Learn CUTLASS profiling methods:

cpp
// CUTLASS built-in profiling
cutlass::profiler::GemmProfiler<
    GemmKernel,
    ProblemSize
> profiler;
profiler.run();

Learning Path

Week 1-2: This Project

    │  Understand GEMM optimization basics
    │  Master shared memory, register blocking


Week 3-4: CUTLASS Examples

    │  Read basic examples
    │  Understand template parameters


Week 5-6: CUTLASS Source

    │  Deep dive into kernel implementation
    │  Learn warp-level operations


Week 7+:  CUTLASS Advanced Features

    │  Tensor Core
    │  Epilogue fusion
    │  Multi-GPU parallelism

CUTLASS Learning Resources

Official Resources

ExampleLearning Focus
0_basic_gemmBasic usage
10_planar_complexComplex layouts
15_gemm_universalUniversal interface
23_gemm_groupedBatched GEMM
27_gemm_with_epilogueOperator fusion

Summary

AspectThis ProjectCUTLASS
PositioningIntroductory teachingProduction library
ComplexityLowHigh
Learning CurveGentleSteep
Production ReadyEducational useYes

Best Practice: Use this project as pre-requisite learning material for CUTLASS, helping understand CUTLASS design philosophy.

MIT License | CUDA GEMM optimization tutorial