The control plane
for LLM inference
Route, autoscale, and meter inference across vLLM, SGLang, and TensorRT-LLM, on any GPU, in any cloud. Run it shared, dedicated, or on-prem, or start free with the open-source gateway.
Try the interactive playgroundWorks with your engines
Inference gets messy after the second backend.
The moment you run more than one model, engine, or GPU tier, your apps inherit infrastructure they shouldn't own. A control plane takes it back.
- Routing & failover logic leaks into every app
- Blackbox shared APIs you can't tune or inspect
- GPUs over-provisioned for peak, idle the rest of the day
- Metrics split across engines, no unified view
- Model rollouts ride along with app deploys, manual and risky
- Per-tenant cost & usage is a mystery
- One OpenAI-compatible endpoint, apps stay unchanged
- Engine-agnostic control plane you fully own
- Autoscaling + scale-to-zero, heterogeneous GPU tiers
- Unified observability across the whole fleet
- Traffic shifts and rollouts by config, not redeploys
- Per-tenant token metering & billing, built in
Operate inference like infrastructure, not glue code.
Reliability that doesn't page you
Active and passive health checks pull failing backends out of rotation, then re-add them on recovery. Autoscaling absorbs the spikes so traffic just keeps flowing.
Cost-efficient by construction
Mix GPU tiers per workload, reuse KV cache across requests, and scale to zero when idle. No more over-provisioning every model for peak that rarely comes.
Portable, never locked in
One control plane over vLLM, SGLang, TensorRT-LLM and more. Run it shared, dedicated, or on-prem, and move between them without touching a line of app code.
* Figures are directional targets. Confirm with your own benchmarks before publishing.
One plane over your whole inference fleet.
Everything between your apps and your GPUs: routing, scaling, caching, orchestration, and metering, all operated from a single layer that sits above any engine.
LLM-aware gateway and routing
An OpenAI-compatible endpoint with token-aware, least-inflight, and EWMA routing. Streaming SSE passthrough and health-based failover come standard.
Autoscaling and heterogeneous serving
Scale replicas to real-time demand, scale to zero when idle, and mix GPU tiers to keep utilization high and hit your SLAs at the lowest possible cost.
Distributed KV cache
High-capacity, cross-engine KV reuse so repeated context never gets recomputed. Lower latency, higher throughput, same hardware.
Model and LoRA orchestration
Declarative model lifecycle and high-density LoRA management. Roll out and shift traffic between versions with a config edit.
Observability and metering
Prometheus metrics, structured audit logs, and per-tenant token usage: the foundation for real SLOs and billing.
GPU failure detection
Spot unhealthy hardware and drain it before it drags down the fleet, then bring capacity back automatically once it recovers.
We'd rather show you than tell you.
Independent, reproducible benchmarks against the alternatives are in the lab right now. Squint if you want a sneak peek, the real numbers drop soon.
Benchmarks are brewing
Full methodology, workloads, and head-to-head results, published the day they're ready.
Want an early look? Ask usOne control plane. Four ways to run it.
Start free and self-hosted, scale into fully-managed shared or dedicated cloud, or bring the whole control plane into your own environment. Same platform, your terms.
SLA-backed uptime
Guaranteed targets on managed tiers, not best-effort.
24/7 support
On-call engineers when production needs a human.
Higher GPU utilization
Pack more work per GPU with batching and scale-to-zero.
Lower GPU cost
Up to 40% less spend through heterogeneous scheduling.
Gateway
Open sourceThe open front door. Routing, failover, and observability you run yourself.
- Routing strategies
- Health-based failover
- OpenAI-compatible API
- Prometheus + JSONL logs
Shared
ManagedFully-managed inference on Tensormux's pooled GPUs. Spin up in minutes.
- Multi-tenant pool
- Per-token pricing
- Autoscaling included
- Usage dashboard
Dedicated
Most popularIsolated, reserved GPU capacity with SLOs and the full control plane.
- Single-tenant isolation
- Reserved capacity + SLOs
- Per-tenant metering
- Priority support
On-Prem
EnterpriseThe full control plane deployed in your own cloud VPC or datacenter.
- Runs in your VPC / DC
- Bring your own GPUs
- Data residency & RBAC
- Forward-deployed support
| Compare | Gateway | Shared | Dedicated | On-Prem |
|---|---|---|---|---|
| Runs where | Your machine | Tensormux cloud | Tensormux cloud | Your VPC / DC |
| Tenancy | You | Multi-tenant | Single-tenant | Your environment |
| GPU ownership | Yours | Ours (pooled) | Ours (reserved) | Yours |
| Autoscaling | ||||
| SLA-backed uptime | ||||
| Token metering & billing | ||||
| Data residency / RBAC | ||||
| 24/7 support | ||||
| Support | Community | Standard | Priority | Forward-deployed |
| Pricing | Free | Per-token | Reserved | Custom |
Three steps. No application rewrite.
Pick a serving model
Self-host the open gateway, or run shared, dedicated, or on-prem. Same control plane underneath.
Connect your models & engines
Register backends across vLLM, SGLang, or TensorRT-LLM. Declare GPU tiers, weights, and SLOs.
Route, scale & meter
Point your OpenAI SDK at one endpoint. Routing, autoscaling, and per-tenant metering start immediately.
Start with the open gateway.
Tensormux Gateway is the open-source front door to the platform: an OpenAI-compatible routing and failover layer you self-host in minutes. When you need scale, isolation, or metering, graduate to the managed control plane. Same mission, no rewrite.
# pull and run the gateway
$ docker pull ghcr.io/krxgu/tensormux:latest
$ docker run -p 8080:8080 tensormux
# point your OpenAI SDK here. nothing else changes
baseURL: "http://localhost:8080/v1"We build in the open.
The gateway is just the start. Some of our internal tools are open source too.
Tensormux Gateway
PythonOpenAI-compatible routing, health-based failover, and observability you self-host.
kernel-skills
TypeScriptA skill library that helps AI agents write, optimize, and debug CUDA and Triton kernels.
TensorPath
PythonInference-optimization control plane: pick the best GPU, backend, and quantization for a model.
Infrastructure as config, apps untouched.
Declare routing, autoscaling, and SLOs in the control plane. Your applications keep talking to one OpenAI-compatible endpoint. No rewrites, no per-service plumbing.
- 1
Deploy the control plane
One install into your own cluster, or use our managed cloud.
- 2
Declare routing, scaling & SLOs
Engines, GPU tiers, autoscaling and latency targets as config.
- 3
Apps keep one endpoint
Your OpenAI SDK changes one base URL. Nothing else is rewritten.
# deploy the control plane into your clusterkubectl apply -k github.com/tensormux/platform/config/default # register a model across engines & GPU tierskubectl apply -f model.yamlRun inference like infrastructure.
Start free with the open-source gateway, or talk to us about shared, dedicated, and on-prem control planes. No sales pitch, just an honest conversation about your stack.
Explore the playground