Tensormux is now part of the NVIDIA Inception Programv1.0 of the open-source gateway is live
Inference control plane · any engine, any GPU, anywhere

The control plane
for LLM inference

Route, autoscale, and meter inference across vLLM, SGLang, and TensorRT-LLM, on any GPU, in any cloud. Run it shared, dedicated, or on-prem, or start free with the open-source gateway.

Try the interactive playground

Works with your engines

vLLMSGLangTensorRT-LLMTritonOllama
0 routed
shared GPU poolYour appsOpenAI SDKTensormux control planeRouteScaleCacheMeterstrategy: Least inflightvLLMH100 ×342ms0 reqSGLangA100 ×288ms0 reqTRT-LLML40S ×261ms0 req
Healthy DegradedMulti-tenant pool · live routing demo
NVIDIA Inception Program member

Built by contributors to leading inference engines.

Routes acrossvLLMSGLangTensorRT-LLMTritonOllama+ more
The shift

Inference gets messy after the second backend.

The moment you run more than one model, engine, or GPU tier, your apps inherit infrastructure they shouldn't own. A control plane takes it back.

Without a control plane
  • Routing & failover logic leaks into every app
  • Blackbox shared APIs you can't tune or inspect
  • GPUs over-provisioned for peak, idle the rest of the day
  • Metrics split across engines, no unified view
  • Model rollouts ride along with app deploys, manual and risky
  • Per-tenant cost & usage is a mystery
On Tensormux
  • One OpenAI-compatible endpoint, apps stay unchanged
  • Engine-agnostic control plane you fully own
  • Autoscaling + scale-to-zero, heterogeneous GPU tiers
  • Unified observability across the whole fleet
  • Traffic shifts and rollouts by config, not redeploys
  • Per-tenant token metering & billing, built in
Why teams run it

Operate inference like infrastructure, not glue code.

99.9%+uptime target

Reliability that doesn't page you

Active and passive health checks pull failing backends out of rotation, then re-add them on recovery. Autoscaling absorbs the spikes so traffic just keeps flowing.

Up to 40%lower GPU spend

Cost-efficient by construction

Mix GPU tiers per workload, reuse KV cache across requests, and scale to zero when idle. No more over-provisioning every model for peak that rarely comes.

5+ enginesany GPU, any cloud

Portable, never locked in

One control plane over vLLM, SGLang, TensorRT-LLM and more. Run it shared, dedicated, or on-prem, and move between them without touching a line of app code.

* Figures are directional targets. Confirm with your own benchmarks before publishing.

The control plane

One plane over your whole inference fleet.

Everything between your apps and your GPUs: routing, scaling, caching, orchestration, and metering, all operated from a single layer that sits above any engine.

LLM-aware gateway and routing

An OpenAI-compatible endpoint with token-aware, least-inflight, and EWMA routing. Streaming SSE passthrough and health-based failover come standard.

Autoscaling and heterogeneous serving

Scale replicas to real-time demand, scale to zero when idle, and mix GPU tiers to keep utilization high and hit your SLAs at the lowest possible cost.

Distributed KV cache

High-capacity, cross-engine KV reuse so repeated context never gets recomputed. Lower latency, higher throughput, same hardware.

Model and LoRA orchestration

Declarative model lifecycle and high-density LoRA management. Roll out and shift traffic between versions with a config edit.

Observability and metering

Prometheus metrics, structured audit logs, and per-tenant token usage: the foundation for real SLOs and billing.

GPU failure detection

Spot unhealthy hardware and drain it before it drags down the fleet, then bring capacity back automatically once it recovers.

Benchmarks

We'd rather show you than tell you.

Independent, reproducible benchmarks against the alternatives are in the lab right now. Squint if you want a sneak peek, the real numbers drop soon.

Coming soon

Benchmarks are brewing

Full methodology, workloads, and head-to-head results, published the day they're ready.

Want an early look? Ask us
Serving models

One control plane. Four ways to run it.

Start free and self-hosted, scale into fully-managed shared or dedicated cloud, or bring the whole control plane into your own environment. Same platform, your terms.

SLA-backed uptime

Guaranteed targets on managed tiers, not best-effort.

24/7 support

On-call engineers when production needs a human.

Higher GPU utilization

Pack more work per GPU with batching and scale-to-zero.

Lower GPU cost

Up to 40% less spend through heterogeneous scheduling.

Gateway

Open source
Freeself-hosted

The open front door. Routing, failover, and observability you run yourself.

  • Routing strategies
  • Health-based failover
  • OpenAI-compatible API
  • Prometheus + JSONL logs
View on GitHub

Shared

Managed
Pay-per-tokenno commitment

Fully-managed inference on Tensormux's pooled GPUs. Spin up in minutes.

  • Multi-tenant pool
  • Per-token pricing
  • Autoscaling included
  • Usage dashboard
Book a call

Dedicated

Most popular
Reservedsingle-tenant

Isolated, reserved GPU capacity with SLOs and the full control plane.

  • Single-tenant isolation
  • Reserved capacity + SLOs
  • Per-tenant metering
  • Priority support
Book a call

On-Prem

Enterprise
Customyour environment

The full control plane deployed in your own cloud VPC or datacenter.

  • Runs in your VPC / DC
  • Bring your own GPUs
  • Data residency & RBAC
  • Forward-deployed support
Talk to us
CompareGatewaySharedDedicatedOn-Prem
Runs whereYour machineTensormux cloudTensormux cloudYour VPC / DC
TenancyYouMulti-tenantSingle-tenantYour environment
GPU ownershipYoursOurs (pooled)Ours (reserved)Yours
Autoscaling
SLA-backed uptime
Token metering & billing
Data residency / RBAC
24/7 support
SupportCommunityStandardPriorityForward-deployed
PricingFreePer-tokenReservedCustom

Not sure which fits? We'll help you map it to your stack.

Book a callOr start with open source
How it works

Three steps. No application rewrite.

1step

Pick a serving model

Self-host the open gateway, or run shared, dedicated, or on-prem. Same control plane underneath.

serving: dedicated
2step

Connect your models & engines

Register backends across vLLM, SGLang, or TensorRT-LLM. Declare GPU tiers, weights, and SLOs.

engine: vllm · gpu: h100
3step

Route, scale & meter

Point your OpenAI SDK at one endpoint. Routing, autoscaling, and per-tenant metering start immediately.

baseURL: "…/v1"
Open source

Start with the open gateway.

Tensormux Gateway is the open-source front door to the platform: an OpenAI-compatible routing and failover layer you self-host in minutes. When you need scale, isolation, or metering, graduate to the managed control plane. Same mission, no rewrite.

MIT-licensed · self-hosted · no vendor lock-in
quickstart
# pull and run the gateway
$ docker pull ghcr.io/krxgu/tensormux:latest
$ docker run -p 8080:8080 tensormux

# point your OpenAI SDK here. nothing else changes
baseURL: "http://localhost:8080/v1"
Operate it your way

Infrastructure as config, apps untouched.

Declare routing, autoscaling, and SLOs in the control plane. Your applications keep talking to one OpenAI-compatible endpoint. No rewrites, no per-service plumbing.

  1. 1

    Deploy the control plane

    One install into your own cluster, or use our managed cloud.

  2. 2

    Declare routing, scaling & SLOs

    Engines, GPU tiers, autoscaling and latency targets as config.

  3. 3

    Apps keep one endpoint

    Your OpenAI SDK changes one base URL. Nothing else is rewritten.

# deploy the control plane into your cluster
kubectl apply -k github.com/tensormux/platform/config/default
# register a model across engines & GPU tiers
kubectl apply -f model.yaml

Run inference like infrastructure.

Start free with the open-source gateway, or talk to us about shared, dedicated, and on-prem control planes. No sales pitch, just an honest conversation about your stack.

Explore the playground