Architecture Overview

Tiny-LLM centers on a narrow CUDA/C++ runtime: a supported binary runtime loader, explicit KV cache management, W8A16 quantized layers, and host-side Result<T> error propagation.

System Sketch

Core Components

Component	Responsibility
`InferenceEngine`	Public runtime entry point for loading supported binary models and generating token IDs
`ModelLoader`	Host-side runtime weight loading for the supported binary format
`GGUFParser`	GGUF parsing, metadata extraction, and tensor inspection
`TransformerLayer`	W8A16 attention + FFN execution
`KVCacheManager`	Pre-allocated cache slots and sequence lifecycle
`Result<T>`	Fallible host-side API boundary

Loading Boundary

Runtime loading: InferenceEngine::load("model.bin", config)
GGUF tooling: GGUFParser for parse/inspect/validate workflows
Not supported: direct .gguf runtime loading through InferenceEngine::load()

Generation Flow

Load weights through the supported binary runtime path.
Feed prompt token IDs into generate().
Run prefill across the Transformer stack.
Reuse KV cache state during decode.
Sample the next token ID from logits until EOS or max_new_tokens.

Notes

Public examples should use token IDs, not an in-repo tokenizer API.
Public docs should describe the binary runtime path and GGUF parser separately.

Next Steps