vs CUTLASS

This document compares Mini-Inference Engine with NVIDIA CUTLASS.

CUTLASS Overview

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source GEMM template library.

Features

Feature	Description
Templated	Highly configurable kernel design
Tensor Core	Complete Tensor Core support
Code Quality	Production-grade, extremely high learning value
Continuous Updates	Follows latest GPU architectures

Code Structure

cutlass/
├── include/
│   ├── gemm/           # GEMM core implementation
│   │   ├── kernel/     # Kernel implementation
│   │   ├── thread/     # Thread-level operations
│   │   └── warp/       # Warp-level operations
│   ├── arch/           # Architecture related
│   ├── transform/      # Data transformation
│   └── epilogue/       # Post-processing (fusion)
└── examples/           # Example code

Relationship with CUTLASS

Complexity Comparison

Aspect	This Project	CUTLASS
Code Lines	~3000	~50000+
Template Usage	Minimal	Heavy
Config Options	~10	~100+
Learning Curve	Gentle	Steep

Feature Comparison

Feature	This Project	CUTLASS
FP32 GEMM	✅	✅
FP16 GEMM	✅	✅
INT8 GEMM	❌	✅
Tensor Core	❌	✅
Batch GEMM	✅	✅
Operator Fusion	✅ (Simple)	✅ (Complete)
Multi-GPU	❌	✅

CUTLASS Core Concepts

1. Layered Abstraction

CUTLASS decomposes GEMM into multiple layers:

cpp

// Pseudocode showing layer structure
namespace cutlass::gemm {

// Threadblock level: computes one tile of C
class GemmKernel {
    // Warp level: computes part of tile
    using WarpIterators = ...;
    
    // Thread level: computes part within warp
    using ThreadIterators = ...;
};

}

2. Template Parameters

CUTLASS uses many template parameters to configure kernel:

cpp

cutlass::gemm::device::Gemm<
    float,                          // ElementA
    cutlass::layout::RowMajor,      // LayoutA
    float,                          // ElementB
    cutlass::layout::ColumnMajor,   // LayoutB
    float,                          // ElementC
    cutlass::layout::RowMajor,      // LayoutC
    float,                          // ElementAccumulator
    cutlass::arch::OpClassSimt,     // OpClass
    cutlass::arch::Sm80             // ArchTag
> gemm_op;

3. Epilogue Fusion

CUTLASS's Epilogue mechanism supports operator fusion:

cpp

using Epilogue = cutlass::epilogue::thread::LinearCombination<
    float,          // Output type
    4,              // Elements per access
    float,          // Accumulator type
    float           // Scale bias type
>;

Design Patterns to Learn

1. Layered Design

This project's four-layer architecture references CUTLASS design:

Application Layer  →  Benchmark / Tests
Engine Layer       →  InferenceEngine / Tensor
Kernel Layer       →  7-Level GEMM
Infrastructure     →  MemoryPool / StreamManager

2. Parameterized Design

This project's AutoTuner references CUTLASS parameterization approach:

cpp

struct GemmConfig {
    int BLOCK_M;
    int BLOCK_N;
    int BLOCK_K;
    int THREAD_M;
    int THREAD_N;
};

3. Performance Analysis

Learn CUTLASS profiling methods:

cpp

// CUTLASS built-in profiling
cutlass::profiler::GemmProfiler<
    GemmKernel,
    ProblemSize
> profiler;
profiler.run();

Learning Path

Recommended Order

Week 1-2: This Project
    │
    │  Understand GEMM optimization basics
    │  Master shared memory, register blocking
    │
    ▼
Week 3-4: CUTLASS Examples
    │
    │  Read basic examples
    │  Understand template parameters
    │
    ▼
Week 5-6: CUTLASS Source
    │
    │  Deep dive into kernel implementation
    │  Learn warp-level operations
    │
    ▼
Week 7+:  CUTLASS Advanced Features
    │
    │  Tensor Core
    │  Epilogue fusion
    │  Multi-GPU parallelism

CUTLASS Learning Resources

Official Resources

Recommended Examples

Example	Learning Focus
`0_basic_gemm`	Basic usage
`10_planar_complex`	Complex layouts
`15_gemm_universal`	Universal interface
`23_gemm_grouped`	Batched GEMM
`27_gemm_with_epilogue`	Operator fusion

Summary

Aspect	This Project	CUTLASS
Positioning	Introductory teaching	Production library
Complexity	Low	High
Learning Curve	Gentle	Steep
Production Ready	Educational use	Yes

Best Practice: Use this project as pre-requisite learning material for CUTLASS, helping understand CUTLASS design philosophy.

vs CUTLASS ​

CUTLASS Overview ​

Features ​

Code Structure ​

Relationship with CUTLASS ​

Complexity Comparison ​

Feature Comparison ​

CUTLASS Core Concepts ​

1. Layered Abstraction ​

2. Template Parameters ​

3. Epilogue Fusion ​

Design Patterns to Learn ​

1. Layered Design ​

2. Parameterized Design ​

3. Performance Analysis ​

Learning Path ​

Recommended Order ​

CUTLASS Learning Resources ​

Official Resources ​

Recommended Examples ​

Summary ​