TL;DR: Instant model hot swaps, fast cold starts, automatic model updates, predictive LLM scaling, secure access control, all on your infrastructure or private cloud.
Get started now at https://outerport.com!
Horizontal scaling of LLM inference is difficult. Preparing a server for LLM inference roughly involves the following steps:
When implemented naively, just these 3 steps can take around ~4 minutes for a small 7B parameter LLM. To optimize this, you need to implement things like model chunking, parallel downloads, network streaming into memory, and use local SSDs. Even after doing all of this, model loading can take upwards of ~30 seconds, a long time to keep impatient customers waiting.
Outerport achieves aĀ ~2 secondĀ model load time by keeping models warm in a pinned memory cache daemon, with predictive orchestration to figure out where & when to keep models warm. We provide what many serverless providers have figured out for container images but specialized for model weights which bring new sets of challenges.
Hereās a live demo of the model hotswapping:
https://www.youtube.com/watch?v=YoA2elVvo_o
With Outerport, you can also get:
push
Ā to our model registry andĀ pull
Ā to get them fast.Overall system architecture:
We (Towaki and Allen) bring experience in ML infrastructure and systems from NVIDIA, Tome, LinkedIn, and Meta. Allen shipped fine-tuned LLM inference features to 10s of millions of customers at his previous startup, and Towaki worked on writing GPU code & optimizing 3D foundation model training at NVIDIA.
Now we want to unlock this capability to everyone else- ping us at founders@outerport.com, book a demo atĀ https://outerport.com.
Our ask:Ā If you are or know someone who fits any of the bills below, weād love to talk! Please reach out toĀ founders@outerport.com or book a demo atĀ https://outerport.com.