The glory work of GPU scheduling is in the frontier data centers with hundreds of thousands of GPUs, but a lot of research work is done with single GPU jobs on modest clusters, and the scheduling leaves much to be desired.
I wish there were a clean way to preempt GPU tasks, so long running tasks could be transparently paused to allow higher priority tasks to get the minimum time-to-results. Manual checkpointing and cooperative multitasking is an option, but it complicates codebases and is fertile ground for bugs.
It feels like most of the pieces are present: Everything goes through page tables on the GPUs already, Nvidia UVM (Unified Virtual Memory) allows demand paging to host memory, and MPS (Multi-Process Service) could act as a CUDA shim to force everything to use a different memory allocator. Memory page thrashing would be catastrophic for GPU tasks, but the idea would be to pause the host task of the low priority process, then let the high priority process force only the necessary pages out (or maybe none at all, if the memory pressure wasn’t high enough) while it is running, then resume the low priority task on completion, allowing it to page everything back in. Task switching at the level of tens of seconds, not milliseconds.
Even if it didn’t handle absolutely all memory (kernel allocations and such) and had some overhead, that would be quite useful.
Of course, Nvidia would prefer you to Just Buy More GPUs!
I wish there were a clean way to preempt GPU tasks, so long running tasks could be transparently paused to allow higher priority tasks to get the minimum time-to-results. Manual checkpointing and cooperative multitasking is an option, but it complicates codebases and is fertile ground for bugs.
It feels like most of the pieces are present: Everything goes through page tables on the GPUs already, Nvidia UVM (Unified Virtual Memory) allows demand paging to host memory, and MPS (Multi-Process Service) could act as a CUDA shim to force everything to use a different memory allocator. Memory page thrashing would be catastrophic for GPU tasks, but the idea would be to pause the host task of the low priority process, then let the high priority process force only the necessary pages out (or maybe none at all, if the memory pressure wasn’t high enough) while it is running, then resume the low priority task on completion, allowing it to page everything back in. Task switching at the level of tens of seconds, not milliseconds.
Even if it didn’t handle absolutely all memory (kernel allocations and such) and had some overhead, that would be quite useful.
Of course, Nvidia would prefer you to Just Buy More GPUs!