Skip to content

Performance

Performance overview and benchmarks for Tiny-LLM.

Key Results

MetricValuevs FP16
Memory7.8 GB50% ↓
Decode85 tok/s9% ↑
Accuracy9.12 ppl0.4% Δ

Benchmarks on LLaMA-7B, RTX 4090, INT8 weights


Memory Efficiency

W8A16 quantization provides significant memory savings:

ComponentFP16INT8 (W8A16)Savings
Model Weights13.5 GB7.0 GB48%
KV Cache (2K)1.0 GB1.0 GB
Activations0.5 GB0.5 GB
Total15.0 GB8.5 GB43%

Throughput

Decode Phase (Token Generation)

Prefill Phase (Prompt Processing)


Kernel Performance

Optimized CUDA kernels achieve high utilization:

KernelTensor CoreMemory BWOccupancy
w8a16_matmul92%580 GB/s87%
attn_decode78%420 GB/s95%
attn_prefill85%480 GB/s82%
rmsnorm380 GB/s100%

Sections

Architecture Impact

Performance is driven by architectural decisions:

Released under the MIT License.