Troubleshooting Guide | 故障排除指南
This guide covers common issues and solutions when building and using HPC-AI-Optimization-Lab.
Table of Contents
Build Issues
CMake Configuration Errors
"Could not find CUDA"
Symptoms:
CMake Error: Could not find CUDASolutions:
- Verify CUDA Toolkit is installed:bash
nvcc --version # Should show CUDA 12.4+1 - Set CUDA path explicitly:bash
export CUDA_HOME=/usr/local/cuda cmake -S . -B build -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME1
2 - Check PATH includes CUDA:bash
echo $PATH # Should include /usr/local/cuda/bin1
"CMake version too old"
Symptoms:
CMake Error: CMake was unable to find a build program corresponding to "Unix Makefiles"Solutions:
- Install CMake 3.24+:bash
# Ubuntu/Debian wget https://github.com/Kitware/CMake/releases/download/v3.28.0/cmake-3.28.0-linux-x86_64.sh chmod +x cmake-*.sh sudo ./cmake-*.sh --prefix=/usr/local1
2
3
4 - Or use pip:bash
pip install cmake --upgrade1
"Compiler doesn't support C++20"
Symptoms:
error: unrecognized command line option '-std=c++20'Solutions:
- Upgrade GCC to 11+:bash
# Ubuntu 22.04+ sudo apt install g++-11 export CXX=g++-111
2
3 - Or use Clang 14+:bash
sudo apt install clang-14 export CXX=clang++-141
2
Compilation Errors
"Tensor Core requires SM 7.0+"
Symptoms:
error: identifier "wmma::load_matrix_sync" is undefinedSolutions:
- Specify GPU architecture explicitly:bash
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="80;90"1 - Or check your GPU:bash
nvidia-smi --query-gpu=compute_cap --format=csv1
"Shared memory size exceeds limit"
Symptoms:
error: shared memory array size exceeds maximumSolutions:
- Reduce tile size in kernel configuration
- Use dynamic shared memory:cpp
extern __shared__ float smem[];1 - Check GPU shared memory limits:bash
nvidia-smi -q -d MEMORY | grep "Shared"1
Runtime Issues
"Invalid device ordinal"
Symptoms:
CUDA error: invalid device ordinalSolutions:
- Check available GPUs:bash
nvidia-smi -L1 - Set visible devices:bash
export CUDA_VISIBLE_DEVICES=01
"Out of memory"
Symptoms:
CUDA error: out of memorySolutions:
- Reduce batch/tensor size
- Check GPU memory:bash
nvidia-smi1 - Use memory pool (CUDA 11.2+):cpp
cudaMemPool_t pool; cudaDeviceGetDefaultMemPool(&pool, 0);1
2
Kernel Launch Failures
"Launch out of resources"
Solutions:
- Reduce block size:cpp
// Instead of 1024 threads dim3 block(256); // Use smaller block1
2 - Check register usage:bash
nvcc --ptxas-options=-v kernel.cu1
Performance Issues
Low Performance on Tensor Core Kernels
Symptoms: TFLOPS much lower than expected
Solutions:
- Ensure dimensions are multiples of 16:cpp
// Tensor Core requires M, N, K divisible by 16 int M = ((M + 15) / 16) * 16; // Pad to multiple of 161
2 - Verify FP16 input:cpp
// Tensor Core requires __half input hpc::Tensor<__half> A(M * K); // Not float1
2 - Check occupancy:bash
nvprof --metrics achieved_occupancy ./program1
Bank Conflicts
Symptoms: Unexpected slowdown in shared memory operations
Solutions:
- Add padding to shared memory arrays:cpp
__shared__ float tile[32][33]; // +1 for bank conflict avoidance1 - Profile with Nsight Compute:bash
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld ./program1
Python Binding Issues
"No module named 'hpc_ai_opt'"
Solutions:
- Build with Python bindings:bash
cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON cmake --build build1
2 - Set PYTHONPATH:bash
export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"1 - Verify:python
import hpc_ai_opt print(hpc_ai_opt.__doc__)1
2
"PyTorch tensors must be on CUDA"
Symptoms:
ValueError: Tensors must be on CUDA deviceSolutions:
# Wrong
x = torch.randn(1024) # CPU tensor
# Correct
x = torch.randn(1024, device="cuda") # GPU tensor2
3
4
5
NaN or Incorrect Results
Solutions:
- Check tensor dtypes match:python
x = torch.randn(1024, device="cuda", dtype=torch.float32) y = torch.empty_like(x) # Same dtype and device1
2 - Verify dimensions:python
# FlashAttention requires head_dim=64 config = { 'head_dim': 64, # Must be 64 ... }1
2
3
4
5
CUDA Errors
Error Code Reference
| Error Code | Description | Common Cause |
|---|---|---|
| 1 | Invalid value | Bad parameter |
| 2 | Out of memory | GPU memory exhausted |
| 8 | Invalid device ordinal | Wrong GPU ID |
| 9 | Invalid kernel image | Architecture mismatch |
| 30 | Unknown error | Usually driver issue |
"CUDA driver version is insufficient"
Solutions:
- Check driver version:bash
nvidia-smi # Look for "Driver Version"1 - Update driver:bash
# Ubuntu sudo apt install nvidia-driver-5351
2 - Match CUDA version:
CUDA Min Driver 12.4 550.54+ 12.3 545.23+ 12.2 535.54+
"CUDA capability not supported"
Solutions:
- Check GPU architecture:bash
nvidia-smi --query-gpu=compute_cap --format=csv1 - Build for correct architecture:bash
# For A100 (SM 8.0) cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="80" # For H100 (SM 9.0) cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="90"1
2
3
4
5
Debugging Tips
Enable CUDA Error Checking
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Use after kernel launch
kernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());2
3
4
5
6
7
8
9
10
11
12
13
14
Use Compute Sanitizer
# Check for memory errors
compute-sanitizer ./build/tests/gemm/test_gemm
# Check for race conditions
compute-sanitizer --tool racecheck ./program
# Check for memory leaks
compute-sanitizer --tool memcheck ./program2
3
4
5
6
7
8
Use Nsight Compute for Profiling
# Detailed kernel analysis
ncu --set full -o profile ./program
# Focus on specific metrics
ncu --metrics gpu__time_duration.sum ./program
# Compare kernels
ncu --set basic ./program2
3
4
5
6
7
8
Use Nsight Systems for Timeline
# System-wide profiling
nsys profile -o timeline ./program
# View results
nsys-ui timeline.nsys-rep2
3
4
5
Getting Help
If your issue isn't covered here:
- Search existing issues: GitHub Issues
- Check documentation: Documentation
- Ask in discussions: GitHub Discussions
- Report a bug: Use the Bug Report Template
When reporting, please include:
- OS and version
- CUDA version (
nvcc --version) - GPU model and driver (
nvidia-smi) - CMake configuration output
- Full error message
- Minimal reproduction code
FAQ
Q: Can I use this without a GPU?
A: No. This library requires an NVIDIA GPU with Compute Capability 7.0+. All kernels execute on the GPU.
Q: Why is my kernel slower than expected?
A: Common reasons:
- Wrong GPU architecture (compile for your GPU)
- Non-optimal dimensions (pad to multiples of 16 for Tensor Core)
- Low occupancy (reduce register usage)
- Bank conflicts (add padding)
Q: Does this work on Windows?
A: Yes, with Visual Studio 2022+ and CUDA 12.4+. Use CMake GUI or Developer Command Prompt.
Q: Can I use this with PyTorch?
A: Yes! Build Python bindings and pass PyTorch CUDA tensors directly:
import torch
import hpc_ai_opt
x = torch.randn(1024, device="cuda")
y = torch.empty_like(x)
hpc_ai_opt.elementwise.relu(x, y)2
3
4
5
6
Still stuck? Open an issue and we'll help!