Skip to content

Benchmarks

Performance benchmarks and profiling data for Tiny-LLM.

System Configuration

Reference benchmarking system:

ComponentSpecification
GPUNVIDIA RTX A6000 (Ampere, 48 GB)
CPUAMD EPYC 7763 64-Core
RAM256 GB DDR4
CUDA12.2
Driver535.104

End-to-End Benchmarks

Throughput (tokens/second)

Model: 7B parameters, 4096 hidden, 32 layers, 32 heads

Batch SizeSequence LengthPrefill (tok/s)Decode (tok/s)Memory (GB)
112812,800854.2
151210,240825.8
120486,4007611.2
412824,00028011.8
451218,43227016.4

Note: Batch > 1 requires sufficient KV cache memory.

W8A16 vs FP16 Comparison

MetricW8A16FP16Improvement
Weight Memory7.5 GB15 GB50%
Activation MemorySameSame-
Throughput85 tok/s78 tok/s9%
Accuracy (perplexity)9.129.080.4%

Kernel Benchmarks

W8A16 Matrix Multiplication

Configuration: M=1, K=4096, N=4096

GPUTime (μs)Throughput (TFLOPS)Tensor Core %
RTX A6000420.8078%
A100350.9682%
RTX 4090281.2085%

Attention Decode

Configuration: batch=1, heads=32, head_dim=128, varying seq_len

Seq LenTime (μs)Memory Bandwidth (GB/s)
12824420
51252780
2048180920
8192680980

Note: Decode is memory bandwidth bound due to KV cache reads.

RMSNorm

Hidden DimTime (μs)Bandwidth (TB/s)
40961.22.7
81922.13.1

Memory Usage

Model Weights (7B Model)

ComponentW8A16 SizeFP16 Size
Embeddings250 MB250 MB
32 × Attention Layers4.0 GB8.0 GB
32 × FFN Layers3.5 GB7.0 GB
Output Norm + LM Head~0~0
Total Weights~7.8 GB~15.3 GB

Runtime Memory

ConfigurationWeightsKV CacheActivationsTotal
Batch=1, Seq=20487.8 GB0.5 GB0.1 GB8.4 GB
Batch=4, Seq=20487.8 GB2.0 GB0.4 GB10.2 GB

KV Cache Formula: 2 × batch × num_layers × seq_len × num_kv_heads × head_dim × sizeof(half)

For 7B model (32 layers, 32 heads, 128 head_dim):

  • Per token: 2 × 32 × 128 × 2 = 16.4 KB
  • 2048 tokens: 32.8 MB per layer → 1.05 GB total per batch

Profiling Guide

Nsight Compute

Profile individual kernels:

bash
# Profile specific kernel
ncu --kernel-name attention_decode \
    --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed \
    ./test_attention

# Full report
ncu -o report.ncu-rep ./benchmark
ncu-ui report.ncu-rep  # Open in GUI

Nsight Systems

Trace full application:

bash
nsys profile -o profile --stats true ./tiny_llm_demo
nsys-ui profile.qdrep

Custom Timers

cpp
#include <chrono>

class Timer {
    using Clock = std::chrono::high_resolution_clock;
    Clock::time_point start_;
public:
    Timer() : start_(Clock::now()) {}
    
    float elapsedMs() {
        auto end = Clock::now();
        return std::chrono::duration<float, std::milli>(end - start_).count();
    }
};

// Usage
Timer t;
engine->generate(prompt, config);
std::cout << "Generation took: " << t.elapsedMs() << " ms" << std::endl;

Released under the MIT License.