One OpenAI endpoint. Many inference backends.
Route across vLLM, SGLang, and TensorRT-LLM with health checks, automatic failover, and Prometheus observability. One config file. One endpoint. No app rewrite.
Built by contributors to leading inference engines.
Inference gets messy fast.
Scaling inference across multiple backends introduces complexity that slows teams down.
What Tensormux does
One gateway for routing, reliability, and observability across your inference fleet.
Smart routing
Choose least inflight, EWMA latency, or weighted round-robin. Tensormux automatically routes to the fastest available backend.
Auto failover
Continuous health checks detect issues instantly. Failed backends are bypassed automatically and brought back when healthy.
Drop-in compatible
Full OpenAI API support with streaming. Just change your base_url — no code changes needed in your application.
Three steps to unified inference routing
Get started in minutes with Docker or as a standalone binary.
Deploy Tensormux in front of your backends
Run Tensormux as a Docker container or binary. Point it at your inference backends via a simple YAML config.
Point your app to Tensormux
Change the OpenAI SDK base_url to your Tensormux host. No code changes beyond that single line.
Tensormux routes and exposes metrics
Requests are routed per your policy. Prometheus metrics, health status, and JSONL audit logs are available instantly.
Integration
See how Tensormux simplifies your inference stack.
Without Tensormux
Bespoke routing logic, manual failover, fragmented metrics.
gateway:
host: 0.0.0.0
port: 8080
strategy: least_inflight
backends:
- name: vllm-fast
url: http://vllm-fast:8000
engine: vllm
model: llama-3.1-8b
weight: 80
tags: ["fast", "gpu-a10"]
health_endpoint: /v1/models
- name: sglang-cheap
url: http://sglang-cheap:8000
engine: sglang
model: llama-3.1-8b
weight: 20
tags: ["cheap", "gpu-t4"]
health_endpoint: /v1/models
health:
interval_s: 5
timeout_s: 2
fail_threshold: 2
success_threshold: 1
logging:
level: info
jsonl_path: tensormux.jsonlEverything you need, open source
Tensormux OSS is production-ready. No feature gates, no usage limits.
Open Source
Available now- Routing policies (least inflight, EWMA, weighted round-robin)
- Health checking and automatic failover
- OpenAI-compatible API passthrough
- SSE streaming support
- Prometheus metrics endpoint
- Status and health endpoints
- YAML-based configuration
- Audit logging (JSONL)
Managed offering
Multi-tenant dashboards, RBAC, SLO monitoring, audit trails, and managed infrastructure. Currently in design — talk to us about your setup.
Preview the console
Try the interactive playground
Talk to us about your inference setup.
Ready to simplify inference routing?
Deploy Tensormux in minutes. One config file, one endpoint, full control. Self-hosted, no vendor lock-in.
Running inference at scale?
Let's talk about your inference setup — multi-model deployments, GPU tier routing, failover requirements, and what a managed control plane would look like for your team.
No sales pitch. Just an honest conversation about your stack.