Getting Started

Welcome to Tiny-LLM! This guide will help you get up and running quickly.

What is Tiny-LLM?

Tiny-LLM is a minimal, educational LLM inference engine implemented in CUDA C++. It provides:

W8A16 Quantization: INT8 weight quantization with 16-bit activations
CUDA Kernels: Optimized kernels for attention, FFN, and more
KV Cache: Efficient autoregressive generation with pre-allocated cache
No-Exception Design: Result\<T\> based error propagation

Prerequisites

Before you begin, ensure you have:

CUDA Toolkit 11.0 or higher
CMake 3.18 or higher
C++17 compatible compiler (GCC 9+, Clang 10+, MSVC 2019+)
NVIDIA GPU with Compute Capability 7.0+ (Volta or newer)

Quick Start

bash

# Clone the repository
git clone https://github.com/AICL-Lab/tiny-llm.git
cd tiny-llm

# Configure and build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build --output-on-failure

Next Steps

Installation Guide - Detailed installation instructions
Quick Start Tutorial - Build your first inference
Architecture Overview - Understand the internals

Getting Started ​

What is Tiny-LLM? ​

Prerequisites ​

Quick Start ​

Next Steps ​

Getting Started

What is Tiny-LLM?

Prerequisites

Quick Start

Next Steps