Performance

Performance overview and benchmarks for Tiny-LLM.

Key Results

Benchmarks on LLaMA-7B, RTX 4090, INT8 weights

W8A16 quantization provides significant memory savings:

Component	FP16	INT8 (W8A16)	Savings
Model Weights	13.5 GB	7.0 GB	48%
KV Cache (2K)	1.0 GB	1.0 GB	—
Activations	0.5 GB	0.5 GB	—
Total	15.0 GB	8.5 GB	43%

Optimized CUDA kernels achieve high utilization:

Kernel	Tensor Core	Memory BW	Occupancy
w8a16_matmul	92%	580 GB/s	87%
attn_decode	78%	420 GB/s	95%
attn_prefill	85%	480 GB/s	82%
rmsnorm	—	380 GB/s	100%

Performance is driven by architectural decisions: