MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

TL;DR: Efficiently serving multiple LLMs have emerged as a crucial and time-sensitive demand within the community, especially for LLM endpoint providers. In this blog, we show that the dynamic popularity of LLMs and the unbalanced resource utilization of LLM inference can be leveraged to achieve high GPU utilization and reduce serving cost. We introduce MuxServe, a novel serving system that efficiently serves multiple LLMs with flexible spatial-temporal multiplexing. MuxServe outperforms the spatial partitioning and temporal multiplexing baselines by up to $1.

Read more here: External Link