References

Academic references and related projects for Tiny-LLM.

Quantization

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Authors: Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer

Summary: Introduces INT8 matrix multiplication for large language models with outlier detection. Our W8A16 approach builds on these foundations.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

Paper: arXiv:2210.17323

Summary: One-shot weight quantization method achieving 3-4 bit quantization with minimal accuracy loss.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Authors: Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han

Paper: arXiv:2306.00978

Summary: Activation-aware weight quantization that preserves important weights for better accuracy.

KV Cache & Memory Management

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Authors: Woosuk Kwon, Zhuohan Li, et al.

Paper: arXiv:2309.06180

Summary: PagedAttention algorithm for efficient KV cache management. Inspired our sequence management approach.

Efficient Memory Management for Large Language Model Serving with PagedAttention

Conference: SOSP 2023

Summary: Detailed explanation of the PagedAttention memory management strategy.

CUDA Optimization

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Paper: arXiv:2205.14135

Summary: IO-aware exact attention algorithm that reduces memory reads/writes. FlashAttention-2 further improves upon this.

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Authors: Tri Dao

Paper: arXiv:2307.08691

Summary: Improved FlashAttention with better parallelism and work partitioning.

CUTLASS: CUDA Templates for Linear Algebra Subroutines

Repository: NVIDIA/cutlass

Summary: CUDA C++ template library for matrix multiplication. Our W8A16 kernel design follows CUTLASS patterns.

Transformer Architecture

LLaMA: Open and Efficient Foundation Language Models

Authors: Hugo Touvron, Thibaut Lavril, et al.

Paper: arXiv:2302.13971

Summary: Foundation for our model architecture, including RMSNorm, SwiGLU, and RoPE.

RoFormer: Enhanced Transformer with Rotary Position Embedding

Authors: Jianlin Su, Yu Lu, et al.

Paper: arXiv:2104.09864

Summary: Rotary Position Embedding (RoPE) implementation used in our position encoding.

llama.cpp

Repository: ggerganov/llama.cpp

Summary: Inference of LLaMA models in pure C/C++. Great reference for CPU optimization techniques.

TensorRT-LLM

Repository: NVIDIA/TensorRT-LLM

Summary: NVIDIA's optimized inference library for LLMs. Reference for production-grade CUDA kernels.

xFormers

Repository: facebookresearch/xformers

Summary: Facebook's library of composable transformer building blocks.

MLC-LLM

Repository: mlc-ai/mlc-llm

Summary: Universal LLM deployment engine with TVM backend.

CUDA Programming Resources

NVIDIA CUDA Programming Guide

URL: CUDA C++ Programming Guide

Summary: Official CUDA programming documentation.

NVIDIA CUDA Best Practices Guide

URL: CUDA C++ Best Practices Guide

Summary: Optimization guidelines for CUDA applications.

Programming Massively Parallel Processors

Authors: David B. Kirk, Wen-mei W. Hwu

Summary: Comprehensive textbook on GPU programming fundamentals.

Performance Analysis Tools

NVIDIA Nsight Compute

URL: Nsight Compute

Summary: Kernel-level profiling and analysis tool.

NVIDIA Nsight Systems

URL: Nsight Systems

Summary: System-wide profiling and tracing tool.

References ​

Quantization ​

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale ​

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ​

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration ​

KV Cache & Memory Management ​

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention ​

Efficient Memory Management for Large Language Model Serving with PagedAttention ​

CUDA Optimization ​

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness ​

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning ​

CUTLASS: CUDA Templates for Linear Algebra Subroutines ​

Transformer Architecture ​

LLaMA: Open and Efficient Foundation Language Models ​

RoFormer: Enhanced Transformer with Rotary Position Embedding ​

Related Projects ​

llama.cpp ​

TensorRT-LLM ​

xFormers ​

MLC-LLM ​

CUDA Programming Resources ​

NVIDIA CUDA Programming Guide ​

NVIDIA CUDA Best Practices Guide ​

Programming Massively Parallel Processors ​

Performance Analysis Tools ​

NVIDIA Nsight Compute ​

NVIDIA Nsight Systems ​