Memory Pool
HTS includes a high-performance GPU memory pool that eliminates the overhead of cudaMalloc and cudaFree calls.
The Problem with cudaMalloc/cudaFree
Standard CUDA memory allocation has significant overhead:
- cudaMalloc: ~50 μs per call
- cudaFree: ~25 μs per call
- Synchronization: Can block the entire GPU
For workloads with frequent small allocations, this overhead can dominate execution time.
HTS Memory Pool Solution
HTS uses a buddy system allocator that:
- Allocates from pre-allocated memory pool
- O(log n) allocation time
- O(1) deallocation time
- Automatic defragmentation
- Zero synchronization overhead
Architecture
┌────────────────────────────────────────────┐
│ Memory Pool (e.g., 4 GB) │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Block 0: 1 GB (allocated) │ │
│ ├──────────────────────────────────────┤ │
│ │ Block 1: 512 MB (free) │ │
│ ├──────────────────────────────────────┤ │
│ │ Block 2: 256 MB (allocated) │ │
│ ├──────────────────────────────────────┤ │
│ │ Block 3: 256 MB (free) │ │
│ ├──────────────────────────────────────┤ │
│ │ ... │ │
│ └──────────────────────────────────────┘ │
└────────────────────────────────────────────┘2
3
4
5
6
7
8
9
10
11
12
13
14
15
Buddy System Explained
The buddy system divides memory into blocks that are powers of 2:
Allocation
- Find the smallest block that fits the request
- If block is too large, split it in half
- Repeat until block is the right size
- Return pointer to allocated block
Deallocation
- Mark block as free
- Check if buddy (adjacent block) is also free
- If yes, merge buddies into larger block
- Repeat to reduce fragmentation
Example
Request: 300 MB
Pool: 4 GB
Step 1: Find 512 MB block (smallest power of 2 ≥ 300 MB)
Step 2: Split 1 GB → 512 MB + 512 MB
Step 3: Allocate first 512 MB
Step 4: Return pointer
Request: 128 MB
Step 1: Use remaining 512 MB from split
Step 2: Split 512 MB → 256 MB + 256 MB
Step 3: Split 256 MB → 128 MB + 128 MB
Step 4: Allocate first 128 MB
Step 5: Return pointer2
3
4
5
6
7
8
9
10
11
12
13
14
Usage
Basic Usage
Memory allocation is handled automatically by HTS when tasks request GPU memory:
auto gpu_task = graph.add_task(DeviceType::GPU, "GPU_Work");
gpu_task->set_gpu_function([](TaskContext& ctx, cudaStream_t stream) {
// Request memory from pool (automatic)
void* ptr = ctx.allocate_gpu(1024 * 1024); // 1 MB
// Use memory for GPU computation
my_kernel<<<blocks, threads, 0, stream>>>(ptr);
// Memory automatically returned to pool on task completion
});2
3
4
5
6
7
8
9
10
Pool Configuration
Configure the memory pool during scheduler initialization:
#include <hts/memory_pool.hpp>
MemoryPoolConfig config;
config.pool_size_mb = 4096; // 4 GB pool
config.min_block_size_kb = 4; // Minimum 4 KB blocks
config.max_block_size_mb = 1024; // Maximum 1 GB blocks
config.enable_defragmentation = true; // Enable auto defrag
scheduler.configure_memory_pool(config);2
3
4
5
6
7
8
9
Manual Allocation
For fine-grained control:
#include <hts/memory_pool.hpp>
// Get memory pool instance
auto& pool = scheduler.get_memory_pool();
// Allocate memory
void* ptr = pool.allocate(1024 * 1024); // 1 MB
// Use memory...
// Free memory (returns to pool, not to OS)
pool.free(ptr);2
3
4
5
6
7
8
9
10
11
12
Operational Characteristics
The memory pool is designed to reduce repeated allocation churn and expose runtime statistics you can inspect in your own workload.
Fragmentation
HTS monitors and manages fragmentation:
auto stats = pool.get_stats();
std::cout << "Fragmentation: " << stats.fragmentation_ratio * 100 << "%" << std::endl;
std::cout << "Total allocated: " << stats.allocated_bytes / 1024 / 1024 << " MB" << std::endl;
std::cout << "Total free: " << stats.free_bytes / 1024 / 1024 << " MB" << std::endl;
std::cout << "Largest block: " << stats.largest_free_block / 1024 / 1024 << " MB" << std::endl;2
3
4
5
Defragmentation
When fragmentation becomes high, HTS can defragment the pool:
Automatic Defragmentation
Enabled by default, runs periodically:
config.enable_defragmentation = true;
config.defrag_threshold = 0.3; // Trigger when 30% fragmented2
Manual Defragmentation
// Trigger defragmentation
pool.defragment();
// Check if defragmentation is needed
if (pool.get_stats().fragmentation_ratio > 0.3) {
pool.defragment();
}2
3
4
5
6
7
How Defragmentation Works
- Pause new allocations briefly
- Identify allocated blocks
- Move blocks to consolidate free space
- Update pointers (handled automatically)
- Resume allocations
Note: Defragmentation may briefly pause allocation, but is optimized to minimize impact.
Best Practices
1. Size the Pool Appropriately
// Good: Size based on workload
size_t estimated_memory = num_tasks * avg_memory_per_task;
config.pool_size_mb = estimated_memory * 1.2 / 1024 / 1024; // 20% headroom2
3
2. Use Task-Level Memory Requests
Let HTS manage memory per task rather than manual allocation:
// Recommended: Automatic
task->set_memory_requirement(256 * 1024 * 1024); // 256 MB
// Avoid: Manual (unless you need fine control)
void* ptr = pool.allocate(256 * 1024 * 1024);2
3
4
5
3. Monitor Fragmentation
// Periodically check fragmentation
auto check_fragmentation = [&]() {
auto stats = pool.get_stats();
if (stats.fragmentation_ratio > 0.25) {
std::cout << "High fragmentation detected: "
<< stats.fragmentation_ratio * 100 << "%" << std::endl;
}
};2
3
4
5
6
7
8
4. Pre-allocate for Large Tasks
For tasks that need large contiguous blocks:
// Reserve memory before task execution
void* reserved = pool.reserve(512 * 1024 * 1024); // 512 MB
task->set_preallocated_ptr(reserved);2
3
5. Avoid Allocation in Hot Path
Don't allocate memory in performance-critical loops:
// Bad: Allocation in loop
for (int i = 0; i < 1000; i++) {
void* ptr = pool.allocate(size); // 50 μs overhead each time!
kernel<<<...>>>(ptr);
pool.free(ptr);
}
// Good: Allocate once, reuse
void* ptr = pool.allocate(size * 1000);
for (int i = 0; i < 1000; i++) {
kernel<<<...>>>(ptr + i * size);
}2
3
4
5
6
7
8
9
10
11
12
Troubleshooting
Out of Memory Errors
If you see HTS_ERROR_OOM:
- Increase pool size:
config.pool_size_mb = 8192; - Enable defragmentation:
config.enable_defragmentation = true; - Reduce task memory usage: Optimize kernels
- Check for memory leaks: Ensure all allocations are freed
High Fragmentation
If fragmentation > 30%:
- Trigger defragmentation:
pool.defragment(); - Increase pool size: More headroom reduces fragmentation
- Use larger block sizes:
config.min_block_size_kb = 64; - Batch allocations: Allocate larger chunks upfront
Performance Issues
If memory allocation is slow:
- Check block size configuration: Larger min blocks can speed up searches
- Use thread-local pools: Avoid contention (advanced)
- Profile allocation hotspots: Use the profiler
- Consider arena allocation: For many small objects
Next Steps
- Error Handling — Handling failures
- Scheduling — Task scheduling policies
- API Reference — Complete API documentation
- Examples — See memory pool in action