Tiny-LLMCUDA C++ Inference, Kept Small

Focused Transformer inference engine with W8A16 kernels, explicit KV cache management, and a deliberately small repository surface.

⚡

W8A16 Runtime

INT8 weights with FP16 activations for a compact CUDA inference path.

📦

Runtime loading stays on the supported binary path, while GGUF parsing remains available for inspection and validation.

🧠

Pre-allocated sequence slots keep autoregressive decoding predictable.

🛡️

Host-side fallible operations return explicit results instead of hiding failures.

🧪

GoogleTest and RapidCheck cover the core loader, cache, and generation paths.

📚

Pages documents the engine itself instead of duplicating process and changelog scaffolding.