Open source. Self-hosted. No vendor lock-in.

One OpenAI endpoint. Many inference backends.

Route across vLLM, SGLang, TensorRT-LLM with health checks, failover, and observability. Drop-in replacement for your OpenAI base URL.

0 routed
Your AppOpenAI SDKbase_urlTensormuxGatewayLeast InflightroutesvLLMA100 GPU45ms0 reqSGLangT4 GPU120ms0 reqTRT-LLML4 GPU65ms0 req
Healthy Degraded UnhealthyClick strategy to cycleClick "Send request" to route manually

Inference gets messy fast.

Scaling inference across multiple backends introduces complexity that slows teams down.

Multiple models and inference engines to manage
Different GPU tiers and regions with varying performance
Backend failures and degraded performance with no automatic handling
Fragmented metrics and no single control plane

What Tensormux does

One gateway for routing, reliability, and observability across your inference fleet.

Routing policies
Least inflight, EWMA latency, weighted round-robin. Pick a strategy or let Tensormux pick the fastest backend automatically.
Health checks and failover
Backends are health-checked continuously. Unhealthy nodes are excluded automatically and recovered when they pass checks again.
OpenAI compatible
Chat completions, streaming via SSE, and standard error responses. Switch your base_url and everything works.

How it works

Three steps to unified inference routing.

1

Deploy Tensormux in front of your backends

Run Tensormux as a Docker container or binary. Point it at your inference backends via a simple YAML config.

2

Point your app to Tensormux

Change the OpenAI SDK base_url to your Tensormux host. No code changes beyond the URL.

3

Tensormux routes and exposes metrics

Requests are routed based on your chosen policy. Prometheus metrics, health status, and audit logs are available instantly.

Integration

See how Tensormux simplifies your inference stack.

Without Tensormux

App → vLLM endpoint (hardcoded URL #1)
App → SGLang endpoint (hardcoded URL #2)
App → TRT-LLM endpoint (hardcoded URL #3)

Bespoke routing logic, manual failover, fragmented metrics.

yaml
gateway:
  host: 0.0.0.0
  port: 8080
  strategy: least_inflight

backends:
  - name: vllm-fast
    url: http://vllm-fast:8000
    engine: vllm
    model: llama-3.1-8b
    weight: 80
    tags: ["fast", "gpu-a10"]

  - name: sglang-cheap
    url: http://sglang-cheap:8000
    engine: sglang
    model: llama-3.1-8b
    weight: 20
    tags: ["cheap", "gpu-t4"]

health_check:
  interval_s: 5
  timeout_s: 2
  fail_threshold: 2
  success_threshold: 1

logging:
  level: info
  file: tensormux.jsonl

OSS vs Paid

The open-source gateway covers production routing. A managed console is planned for teams that need more.

Open Source

Available now
  • Routing policies (least inflight, EWMA, weighted round-robin)
  • Health checking and automatic failover
  • OpenAI-compatible API passthrough
  • SSE streaming support
  • Prometheus metrics endpoint
  • Status and health endpoints
  • YAML-based configuration
  • Audit logging (JSONL)

Managed Console

Preview
  • Multi-tenant dashboards
  • SLO monitoring and alerting
  • Policy management UI
  • Audit trails with search
  • RBAC and team access controls
  • Config rollout workflows

Ready to simplify inference routing?

Deploy Tensormux in minutes. One config file, one endpoint, full control.