One OpenAI endpoint. Many inference backends.
Route across vLLM, SGLang, and TensorRT-LLM with health checks, automatic failover, and Prometheus observability. One config file. One endpoint. No app rewrite.
Inference gets messy after the second backend.
Once you run more than one model or one engine, your application picks up infrastructure responsibilities that don't belong there.
Routing logic leaks into application code
One app now talks to multiple inference engines, each with its own client, base URL, and quirks.
Failover is reinvented per service
Retry policies, circuit breakers, and health probes become bespoke code in every service that touches a model.
Metrics are split across engines
Latency, throughput, and error rates live in different formats, in different places, with no unified view.
Model rollouts become manual and risky
Shifting traffic between GPU tiers or new model versions requires app deploys instead of a config change.
A small, opinionated routing layer between your app and your fleet.
Strategy by config, not code
Route by least inflight, EWMA latency, or weighted round-robin. Switch strategies without touching application code.
Health-aware failover
Active and passive checks detect unhealthy backends and remove them from routing automatically. They re-enter when they recover.
OpenAI-compatible gateway
Point your SDK at Tensormux and keep your app unchanged. Streaming, model lists, and chat completions pass through.
Three steps. No application rewrite.
Deploy in front of your backends
Run Tensormux as a single binary, container, or sidecar. Point it at your existing inference engines.
Update one config file
Declare backends, models, weights, and tags in YAML. Choose a routing strategy. Set health-check thresholds.
Change the base_url
Repoint your existing OpenAI SDK at Tensormux. Routing, failover, and metrics begin immediately.
Before Tensormux, after Tensormux.
Your application stops being responsible for inference infrastructure. One base URL, one config file, one control point.
App owns inference plumbing
- Custom routing in app code
- Manual failover, manual retries
- Fragmented logs across engines
- Model rollouts ship as app deploys
One endpoint, declarative routing
- One base URL for the SDK
- Failover handled by gateway
- Unified metrics and audit logs
- Rollouts are config edits
gateway:
host: 0.0.0.0
port: 8080
strategy: least_inflight
backends:
- name: vllm-fast
url: http://vllm-fast:8000
engine: vllm
model: llama-3.1-8b
weight: 80
tags: ["fast", "gpu-a10"]
health_endpoint: /v1/models
- name: sglang-cheap
url: http://sglang-cheap:8000
engine: sglang
model: llama-3.1-8b
weight: 20
tags: ["cheap", "gpu-t4"]
health_endpoint: /v1/models
health:
interval_s: 5
timeout_s: 2
fail_threshold: 2
success_threshold: 1
logging:
level: info
jsonl_path: tensormux.jsonlEverything you need, open source.
Tensormux OSS is production-ready. No feature gates, no usage limits.
- Routing policies (least inflight, EWMA, weighted round-robin)
- Health checking and automatic failover
- OpenAI-compatible API passthrough
- SSE streaming support
- Prometheus metrics endpoint
- Status and health endpoints
- YAML-based configuration
- Audit logging (JSONL)
Multi-tenant dashboards, RBAC, SLO monitoring, audit trails, and managed infrastructure. Currently in design — talk to us about your setup.
Preview the console
Interactive playground
Ready to simplify inference routing?
Deploy Tensormux in minutes. One config file, one endpoint, full control. Self-hosted, no vendor lock-in.
Running inference at scale?
Let's talk about your inference setup — multi-model deployments, GPU tier routing, failover requirements, and what a managed control plane would look like for your team.
No sales pitch. Just an honest conversation about your stack.