Performance
This guide covers performance optimization techniques for Tiny-LLM.
Benchmarking
Running Benchmarks
bash
./build/bin/tinyllm-bench --model model.bin --prompt "Hello, world!"Key Metrics
| Metric | Description |
|---|---|
| Tokens/sec | Generation throughput |
| Time to First Token | Latency for first token |
| Memory Usage | GPU memory consumption |
| Utilization | GPU compute utilization |
Optimization Techniques
1. KV Cache Tuning
cpp
KVCacheConfig config;
config.max_batch_size = 1; // Single sequence
config.max_seq_len = 2048; // Match your needs
config.enable_swapping = false; // Disable for single GPU2. Batch Size
For throughput-critical applications:
cpp
// Batch multiple sequences
engine.setBatchSize(8);3. Flash Attention
Enable Flash Attention for faster inference:
cpp
config.enable_flash_attention = true;4. CUDA Graphs
Reduce kernel launch overhead:
cpp
config.enable_cuda_graphs = true;Memory Optimization
Memory Breakdown
| Component | Memory (LLaMA-7B) |
|---|---|
| Model Weights (INT8) | ~3.5 GB |
| KV Cache (2048 ctx) | ~1.0 GB |
| Activations | ~0.5 GB |
| Total | ~5.0 GB |
Reducing Memory Usage
Reduce context length:
cppconfig.max_seq_len = 1024; // Halves KV cacheEnable KV cache offloading:
cppconfig.enable_swapping = true;Use smaller batch size:
cppconfig.max_batch_size = 1;
Profiling
Using Nsight Systems
bash
nsys profile -o profile ./build/bin/tinyllm-bench --model model.binUsing Nsight Compute
bash
ncu --set full -o kernel_profile ./build/bin/tinyllm-bench --model model.binCUDA Profiling Tools
bash
# Enable CUDA profiling
export CUDA_PROFILE=1
./build/bin/tinyllm-bench --model model.binPerformance Guidelines
GPU Selection
| GPU | Recommended For |
|---|---|
| RTX 3060 (12GB) | Small models (7B) |
| RTX 4090 (24GB) | Medium models (13B-30B) |
| A100 (40GB) | Large models (65B+) |
| H100 (80GB) | Largest models |
Software Configuration
- Use CUDA 12+ for best performance
- Enable P-State 0 for maximum clock:bash
sudo nvidia-smi -i 0 -pl 300 # Set power limit - Disable ECC for slightly more memory:bash
sudo nvidia-smi -e 0
Benchmarks
LLaMA-7B (INT8) on RTX 4090
| Batch Size | Prefill (tokens/sec) | Decode (tokens/sec) |
|---|---|---|
| 1 | 850 | 65 |
| 4 | 2100 | 180 |
| 8 | 3400 | 290 |
Memory Scaling
| Context Length | KV Cache Memory |
|---|---|
| 512 | 256 MB |
| 1024 | 512 MB |
| 2048 | 1.0 GB |
| 4096 | 2.0 GB |
Troubleshooting Performance
Low Token Generation Rate
- Check GPU utilization:
nvidia-smi dmon - Verify CUDA version compatibility
- Ensure model is loaded to GPU
- Check for CPU bottlenecks
Memory Errors
- Reduce context length
- Reduce batch size
- Enable KV cache swapping
Next Steps
- Troubleshooting Guide - Common issues
- API Reference - Complete API documentation