Troubleshooting
Common issues and solutions for Tiny-LLM.
Build Issues
CUDA not found
Error: Could not find CUDA or nvcc not found
Solutions:
# Check CUDA installation
nvcc --version
# Set CUDA path explicitly
cmake .. -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-12.2
# Or add to PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHCMake version too old
Error: CMake 3.18 or higher is required
Solutions:
# Using pip
pip install --upgrade cmake
# Using snap (Ubuntu)
sudo snap install cmake --classic
# Build from source
curl -L https://cmake.org/files/v3.28/cmake-3.28.0.tar.gz | tar xz
cd cmake-3.28.0 && ./bootstrap && make && sudo make installC++17 not supported
Error: error: 'auto' in lambda parameter not supported
Solutions:
# Check compiler version
gcc --version # Should be 9+
clang --version # Should be 10+
# Specify compiler
cmake .. -DCMAKE_CXX_COMPILER=g++-11
# Or use environment variable
CC=gcc-11 CXX=g++-11 cmake ..CUDA architecture mismatch
Error: No kernel image is available for execution on the device
Solutions:
# Check your GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv
# Build for your specific architecture
cmake .. -DCUDA_ARCH="80" # For SM 8.0 (A100)
cmake .. -DCUDA_ARCH="86" # For SM 8.6 (RTX 3090)
cmake .. -DCUDA_ARCH="89" # For SM 8.9 (RTX 4090)
# Or use native detection
cmake .. -DCUDA_ARCH="native"Runtime Issues
CUDA out of memory
Error: CUDA out of memory or cudaErrorMemoryAllocation
Solutions:
Reduce batch size
cppcache_config.max_batch_size = 1; // Reduce from 4Reduce sequence length
cppconfig.max_seq_len = 1024; // Reduce from 2048Monitor memory
cppsize_t free, total; cudaMemGetInfo(&free, &total); std::cout << "Free: " << free / 1024 / 1024 << " MB" << std::endl;
Illegal memory access
Error: an illegal memory access was encountered
Possible causes:
- Incorrect model file format
- Dimension mismatch between model and config
- Uninitialized memory
Solutions:
Enable debug mode
bashcmake .. -DCMAKE_BUILD_TYPE=Debug CUDA_LAUNCH_BLOCKING=1 ./tiny_llm_demoRun with cuda-memcheck
bashcuda-memcheck ./tiny_llm_demo compute-sanitizer ./tiny_llm_demoVerify model dimensions
cppstd::cout << "Config: " << config.hidden_dim << " x " << config.num_layers << std::endl;
Slow generation speed
Possible causes:
- Debug build
- Not using W8A16 quantization
- Incorrect CUDA architecture
Solutions:
Use Release build
bashcmake .. -DCMAKE_BUILD_TYPE=ReleaseVerify GPU utilization
bashwatch -n 1 nvidia-smiProfile the application
bashnsys profile -o profile ./tiny_llm_demo nsys-ui profile.qdrep
Performance Issues
Low GPU utilization
Symptom: GPU utilization < 50%
Solutions:
- Increase batch size
- Check memory bandwidth bound operations
- Profile kernels with Nsight Compute
Memory bandwidth bottleneck
Symptom: Decode phase slower than expected
Cause: Attention decode is memory bandwidth bound
Solutions:
- Use faster GPU with higher bandwidth
- Reduce KV cache size (smaller batch/seq_len)
- Enable flash attention (if available)
Model Loading Issues
Invalid model file
Error: Failed to load model: invalid format
Checklist:
- [ ] File exists and is readable
- [ ] Magic number matches (first 4 bytes)
- [ ] Version is supported
- [ ] Dimensions match config
Dimension mismatch
Error: Weight dimension mismatch
Solutions:
// Verify config
std::cout << "vocab_size: " << config.vocab_size << std::endl;
std::cout << "hidden_dim: " << config.hidden_dim << std::endl;
std::cout << "intermediate_dim: " << config.intermediate_dim << std::endl;Getting Help
Debug Information to Include
When reporting issues, please provide:
System info
bashnvidia-smi nvcc --version cmake --versionBuild output
bashcmake .. 2>&1 | tee cmake.log make VERBOSE=1 2>&1 | tee build.logRuntime error
bashCUDA_LAUNCH_BLOCKING=1 ./tiny_llm_demo 2>&1 | tee runtime.log