Architecture Overview
Tiny-LLM centers on a narrow CUDA/C++ runtime: a supported binary runtime loader, explicit KV cache management, W8A16 quantized layers, and host-side Result<T> error propagation.
System Sketch
Core Components
| Component | Responsibility |
|---|---|
InferenceEngine | Public runtime entry point for loading supported binary models and generating token IDs |
ModelLoader | Host-side runtime weight loading for the supported binary format |
GGUFParser | GGUF parsing, metadata extraction, and tensor inspection |
TransformerLayer | W8A16 attention + FFN execution |
KVCacheManager | Pre-allocated cache slots and sequence lifecycle |
Result<T> | Fallible host-side API boundary |
Loading Boundary
- Runtime loading:
InferenceEngine::load("model.bin", config) - GGUF tooling:
GGUFParserfor parse/inspect/validate workflows - Not supported: direct
.ggufruntime loading throughInferenceEngine::load()
Generation Flow
- Load weights through the supported binary runtime path.
- Feed prompt token IDs into
generate(). - Run prefill across the Transformer stack.
- Reuse KV cache state during decode.
- Sample the next token ID from logits until EOS or
max_new_tokens.
Notes
- Public examples should use token IDs, not an in-repo tokenizer API.
- Public docs should describe the binary runtime path and GGUF parser separately.