Skip to content

Performance

This guide covers performance optimization techniques for Tiny-LLM.

Benchmarking

Running Benchmarks

bash
./build/bin/tinyllm-bench --model model.bin --prompt "Hello, world!"

Key Metrics

MetricDescription
Tokens/secGeneration throughput
Time to First TokenLatency for first token
Memory UsageGPU memory consumption
UtilizationGPU compute utilization

Optimization Techniques

1. KV Cache Tuning

cpp
KVCacheConfig config;
config.max_batch_size = 1;     // Single sequence
config.max_seq_len = 2048;      // Match your needs
config.enable_swapping = false; // Disable for single GPU

2. Batch Size

For throughput-critical applications:

cpp
// Batch multiple sequences
engine.setBatchSize(8);

3. Flash Attention

Enable Flash Attention for faster inference:

cpp
config.enable_flash_attention = true;

4. CUDA Graphs

Reduce kernel launch overhead:

cpp
config.enable_cuda_graphs = true;

Memory Optimization

Memory Breakdown

ComponentMemory (LLaMA-7B)
Model Weights (INT8)~3.5 GB
KV Cache (2048 ctx)~1.0 GB
Activations~0.5 GB
Total~5.0 GB

Reducing Memory Usage

  1. Reduce context length:

    cpp
    config.max_seq_len = 1024;  // Halves KV cache
  2. Enable KV cache offloading:

    cpp
    config.enable_swapping = true;
  3. Use smaller batch size:

    cpp
    config.max_batch_size = 1;

Profiling

Using Nsight Systems

bash
nsys profile -o profile ./build/bin/tinyllm-bench --model model.bin

Using Nsight Compute

bash
ncu --set full -o kernel_profile ./build/bin/tinyllm-bench --model model.bin

CUDA Profiling Tools

bash
# Enable CUDA profiling
export CUDA_PROFILE=1
./build/bin/tinyllm-bench --model model.bin

Performance Guidelines

GPU Selection

GPURecommended For
RTX 3060 (12GB)Small models (7B)
RTX 4090 (24GB)Medium models (13B-30B)
A100 (40GB)Large models (65B+)
H100 (80GB)Largest models

Software Configuration

  1. Use CUDA 12+ for best performance
  2. Enable P-State 0 for maximum clock:
    bash
    sudo nvidia-smi -i 0 -pl 300  # Set power limit
  3. Disable ECC for slightly more memory:
    bash
    sudo nvidia-smi -e 0

Benchmarks

LLaMA-7B (INT8) on RTX 4090

Batch SizePrefill (tokens/sec)Decode (tokens/sec)
185065
42100180
83400290

Memory Scaling

Context LengthKV Cache Memory
512256 MB
1024512 MB
20481.0 GB
40962.0 GB

Troubleshooting Performance

Low Token Generation Rate

  1. Check GPU utilization: nvidia-smi dmon
  2. Verify CUDA version compatibility
  3. Ensure model is loaded to GPU
  4. Check for CPU bottlenecks

Memory Errors

  1. Reduce context length
  2. Reduce batch size
  3. Enable KV cache swapping

Next Steps

Released under the MIT License.