Tensormux is now part of the NVIDIA Inception Programv1.0 is live — Star us on GitHub →
— OPEN SOURCE. SELF-HOSTED.

One OpenAI endpoint. Many inference backends.

Route across vLLM, SGLang, and TensorRT-LLM with health checks, automatic failover, and Prometheus observability. One config file. One endpoint. No app rewrite.

0 routed
Your AppOpenAI SDKbase_urlTensormuxGatewayLeast InflightroutesvLLMA100 GPU45ms0 reqSGLangT4 GPU120ms0 reqTRT-LLML4 GPU65ms0 req
Healthy Degraded UnhealthyClick strategy to cycleClick "Send request" to route manually
NVIDIA Inception Program member

Built by contributors to leading inference engines.

Works withvLLMSGLangTensorRT-LLMTritonOllama+ more
— THE PROBLEM

Inference gets messy after the second backend.

Once you run more than one model or one engine, your application picks up infrastructure responsibilities that don't belong there.

01

Routing logic leaks into application code

One app now talks to multiple inference engines, each with its own client, base URL, and quirks.

02

Failover is reinvented per service

Retry policies, circuit breakers, and health probes become bespoke code in every service that touches a model.

03

Metrics are split across engines

Latency, throughput, and error rates live in different formats, in different places, with no unified view.

04

Model rollouts become manual and risky

Shifting traffic between GPU tiers or new model versions requires app deploys instead of a config change.

— WHAT IT DOES

A small, opinionated routing layer between your app and your fleet.

ROUTING

Strategy by config, not code

Route by least inflight, EWMA latency, or weighted round-robin. Switch strategies without touching application code.

strategy
least_inflight
fallback
weighted
weights
vllm:3 trt:2 sglang:1
tag
cheap | premium
RELIABILITY

Health-aware failover

Active and passive checks detect unhealthy backends and remove them from routing automatically. They re-enter when they recover.

event
backend degraded
event
excluded from pool
event
recovered · re-added
API

OpenAI-compatible gateway

Point your SDK at Tensormux and keep your app unchanged. Streaming, model lists, and chat completions pass through.

POST
/v1/chat/completions
POST
/v1/completions
GET
/v1/models
GET
/tensormux/status
— HOW IT WORKS

Three steps. No application rewrite.

1step

Deploy in front of your backends

Run Tensormux as a single binary, container, or sidecar. Point it at your existing inference engines.

$ docker run tensormux:latest
2step

Update one config file

Declare backends, models, weights, and tags in YAML. Choose a routing strategy. Set health-check thresholds.

strategy: least_inflight
3step

Change the base_url

Repoint your existing OpenAI SDK at Tensormux. Routing, failover, and metrics begin immediately.

baseURL: "tensormux:8080/v1"
— INTEGRATION

Before Tensormux, after Tensormux.

Your application stops being responsible for inference infrastructure. One base URL, one config file, one control point.

BEFORE— today

App owns inference plumbing

app → vllm-fast.internal:8000
app → sglang-cheap.internal:8000
app → trt-llm-t4.internal:8000
  • Custom routing in app code
  • Manual failover, manual retries
  • Fragmented logs across engines
  • Model rollouts ship as app deploys
AFTER— this afternoon

One endpoint, declarative routing

app → tensormux:8080 →
├ vllm-fast
├ sglang-cheap
└ trt-llm-t4
  • One base URL for the SDK
  • Failover handled by gateway
  • Unified metrics and audit logs
  • Rollouts are config edits
gateway: host: 0.0.0.0 port: 8080 strategy: least_inflight backends: - name: vllm-fast url: http://vllm-fast:8000 engine: vllm model: llama-3.1-8b weight: 80 tags: ["fast", "gpu-a10"] health_endpoint: /v1/models - name: sglang-cheap url: http://sglang-cheap:8000 engine: sglang model: llama-3.1-8b weight: 20 tags: ["cheap", "gpu-t4"] health_endpoint: /v1/models health: interval_s: 5 timeout_s: 2 fail_threshold: 2 success_threshold: 1 logging: level: info jsonl_path: tensormux.jsonl
— OPEN SOURCE VS ENTERPRISE

Everything you need, open source.

Tensormux OSS is production-ready. No feature gates, no usage limits.

OPEN SOURCE— available now
  • Routing policies (least inflight, EWMA, weighted round-robin)
  • Health checking and automatic failover
  • OpenAI-compatible API passthrough
  • SSE streaming support
  • Prometheus metrics endpoint
  • Status and health endpoints
  • YAML-based configuration
  • Audit logging (JSONL)
MANAGED OFFERING— enterprise preview

Multi-tenant dashboards, RBAC, SLO monitoring, audit trails, and managed infrastructure. Currently in design — talk to us about your setup.

Preview the console

Interactive playground

— GET STARTED
OPEN SOURCE

Ready to simplify inference routing?

Deploy Tensormux in minutes. One config file, one endpoint, full control. Self-hosted, no vendor lock-in.

ENTERPRISE

Running inference at scale?

Let's talk about your inference setup — multi-model deployments, GPU tier routing, failover requirements, and what a managed control plane would look like for your team.

No sales pitch. Just an honest conversation about your stack.