Skip to content

Memory Manager API

Complete reference for memory management.

Class Definition

cpp
class MemoryManager {
public:
    static MemoryManager& getInstance();

    // Pinned host memory
    void* allocatePinned(size_t size);
    void freePinned(void* ptr);

    // Device memory
    void* allocateDevice(size_t size);
    void freeDevice(void* ptr);

    // Asynchronous transfers
    cudaError_t copyToDeviceAsync(void* dst, const void* src, 
                                  size_t size, cudaStream_t stream);
    cudaError_t copyToHostAsync(void* dst, const void* src, 
                                size_t size, cudaStream_t stream);

    // Pool management
    void setPinnedPoolSize(size_t size);
    void setDevicePoolSize(size_t size);

    // Allocator mode (v2)
    void setDeviceAllocatorMode(DeviceAllocatorMode mode);
    DeviceAllocatorMode getRequestedDeviceAllocatorMode() const;
    DeviceAllocatorMode getEffectiveDeviceAllocatorMode() const;
    bool supportsAsyncDeviceAllocator() const;

    // Cleanup
    void shutdown();
};

Methods

Singleton Access

cpp
MemoryManager& mm = MemoryManager::getInstance();

Pinned Memory

Pinned (page-locked) memory enables faster DMA transfers:

cpp
void* h_data = mm.allocatePinned(size);
// ... use memory ...
mm.freePinned(h_data);

Device Memory

cpp
void* d_data = mm.allocateDevice(size);
// ... use memory ...
mm.freeDevice(d_data);

Async Transfers

cpp
cudaStream_t stream;
cudaStreamCreate(&stream);

// Host to Device
mm.copyToDeviceAsync(d_data, h_data, size, stream);

// Device to Host
mm.copyToHostAsync(h_data, d_data, size, stream);

cudaStreamSynchronize(stream);

Pool Configuration

cpp
MemoryManager& mm = MemoryManager::getInstance();

// Configure pool sizes before first allocation
mm.setPinnedPoolSize(128 * 1024 * 1024);  // 128MB
mm.setDevicePoolSize(512 * 1024 * 1024);  // 512MB

Thread Safety

MemoryManager is thread-safe. All methods can be called concurrently from multiple threads.

Best-Fit Allocation

The memory manager uses a best-fit allocation strategy:

  1. Searches for the smallest block that fits the request
  2. Splits larger blocks if necessary
  3. Coalesces adjacent free blocks on deallocation
  4. Minimizes fragmentation over time

Memory Pool Architecture

┌─────────────────────────────────────────────────────────────┐
│                    MemoryManager                             │
├─────────────────────────────────────────────────────────────┤
│  Pinned Memory Pool          │  Device Memory Pool          │
│  ┌─────────────────────┐     │  ┌─────────────────────┐     │
│  │ Block 1 (256KB)     │     │  │ Block 1 (1MB)       │     │
│  │ Block 2 (512KB)     │     │  │ Block 2 (2MB)       │     │
│  │ Block 3 (1MB)       │     │  │ Block 3 (4MB)       │     │
│  │ ...                 │     │  │ ...                 │     │
│  └─────────────────────┘     │  └─────────────────────┘     │
├─────────────────────────────────────────────────────────────┤
│  Benefits:                                                   │
│  • Reuse across pipeline executions                          │
│  • Reduced allocation overhead                               │
│  • Lower fragmentation                                       │
└─────────────────────────────────────────────────────────────┘

v2 Runtime Extensions

Stream-Ordered Allocation (CUDA 11.2+)

cpp
mm.setDeviceAllocatorMode(DeviceAllocatorMode::ASYNC_STREAM_ORDERED);

void* d_data = mm.allocateDevice(size, stream);
mm.freeDevice(d_data, stream);

Async Memory Pools

When available, uses CUDA's native async memory pools:

cpp
// Automatically uses cudaMemPool if supported
mm.setDeviceAllocatorMode(DeviceAllocatorMode::CUDA_ASYNC_POOL);

Released under the MIT License.