Skip to content

Memory Model

Memory layout and management in Tiny-LLM.

Overview

Tiny-LLM uses a carefully designed memory model to minimize GPU memory usage while maintaining high throughput.


Weight Memory

Quantized Weight Layout

Weights are stored in W8A16 format:

Memory Calculation

For a 7B model with hidden_dim=4096, intermediate_dim=11008, num_layers=32:

ComponentShapeINT8 SizeFP16 SizeSavings
Attention QKV[3, 4096, 4096]150 MB300 MB50%
Attention Out[4096, 4096]16 MB32 MB50%
FFN (gate+up+down)[11008×2+4096, 4096]264 MB528 MB50%
Per Layer430 MB860 MB50%
Total (32 layers)13.8 GB27.5 GB50%

KV Cache Memory

Cache Structure

Cache Memory Calculation

For LLaMA-7B (32 layers, 32 heads, 128 head_dim):

Context LengthBatch=1Batch=4Batch=8
512256 MB1.0 GB2.0 GB
1024512 MB2.0 GB4.0 GB
20481.0 GB4.0 GB8.0 GB
40962.0 GB8.0 GB16.0 GB

Activation Buffers

Buffer Types

cpp
struct ActivationBuffers {
    // Hidden states
    float* hidden;      // [batch, seq, hidden_dim]

    // Attention intermediate
    float* qkv;         // [batch, seq, 3, hidden_dim]
    float* attn_out;    // [batch, seq, hidden_dim]

    // FFN intermediate
    float* ffn_inter;   // [batch, seq, intermediate_dim]

    // Output
    float* logits;      // [batch, seq, vocab_size]
};

Buffer Memory (7B Model, B=1, S=2048)

BufferShapeSize
Hidden States[1, 2048, 4096]16 MB
QKV[1, 2048, 12288]48 MB
FFN Intermediate[1, 2048, 11008]44 MB
Logits[1, 2048, 32000]250 MB
Total~360 MB

Memory Management

RAII Pattern

All GPU memory uses RAII for automatic cleanup:

cpp
template<typename T>
class CudaBuffer {
    T* ptr_ = nullptr;
    size_t size_ = 0;

public:
    CudaBuffer(size_t size) : size_(size) {
        cudaMalloc(&ptr_, size * sizeof(T));
    }

    ~CudaBuffer() {
        if (ptr_) cudaFree(ptr_);
    }

    // Move-only (no copying)
    CudaBuffer(const CudaBuffer&) = delete;
    CudaBuffer& operator=(const CudaBuffer&) = delete;
    CudaBuffer(CudaBuffer&&) noexcept;
    CudaBuffer& operator=(CudaBuffer&&) noexcept;
};

Stream-Ordered Allocation

cpp
// Use CUDA streams for async allocation
cudaStream_t stream;
cudaStreamCreate(&stream);

// Stream-ordered malloc
cudaMallocAsync(&ptr, size, stream);

// Stream-ordered free
cudaFreeAsync(ptr, stream);

Memory Optimization Strategies

1. Weight Quantization

cpp
// Load INT8 weights instead of FP16
ModelConfig config;
config.quantization = QuantizationType::W8A16;
config.group_size = 128;  // Per-group scales

2. KV Cache Optimization

cpp
// Limit cache size
KVCacheConfig cache_config;
cache_config.max_batch_size = 1;
cache_config.max_seq_len = 2048;  // Match your needs

3. Activation Recomputation

Trade compute for memory by recomputing activations:

cpp
config.recompute_activations = true;
// Reduces activation memory by ~50%

4. Gradient Checkpointing

For training scenarios:

cpp
config.gradient_checkpointing = true;
// Trade compute for memory during backward pass

Memory Debugging

Tracking Allocations

cpp
// Enable memory tracking
#define CUDA_MEMORY_TRACKING 1

// Get allocation info
size_t free, total;
cudaMemGetInfo(&free, &total);
std::cout << "GPU Memory: " << (total - free) / 1024 / 1024
          << " MB / " << total / 1024 / 1024 << " MB" << std::endl;

Memory Leaks

bash
# Check for leaks
cuda-memcheck --tool memcheck ./build/bin/tinyllm-bench

# Or use compute sanitizer
compute-sanitizer --tool memcheck ./build/bin/tinyllm-bench

Next Steps

Released under the MIT License.