Tensormux v1.0 is live on GitHub. Open-source inference gateway, no strings attached. Star us →
Open source. Self-hosted. No vendor lock-in.

One OpenAI endpoint. Many inference backends.

Route across vLLM, SGLang, and TensorRT-LLM with health checks, automatic failover, and Prometheus observability. One config file. One endpoint. No app rewrite.

0 routed
Your AppOpenAI SDKbase_urlTensormuxGatewayLeast InflightroutesvLLMA100 GPU45ms0 reqSGLangT4 GPU120ms0 reqTRT-LLML4 GPU65ms0 req
Healthy Degraded UnhealthyClick strategy to cycleClick "Send request" to route manually

Built by contributors to leading inference engines.

Works withvLLMSGLangTensorRT-LLMTritonOllama+ more
The Challenge

Inference gets messy fast.

Scaling inference across multiple backends introduces complexity that slows teams down.

Multiple models and inference engines to manage
Different GPU tiers and regions with varying performance
Backend failures and degraded performance with no automatic handling
Fragmented metrics and no single control plane
Core Features

What Tensormux does

One gateway for routing, reliability, and observability across your inference fleet.

01

Smart routing

Choose least inflight, EWMA latency, or weighted round-robin. Tensormux automatically routes to the fastest available backend.

02

Auto failover

Continuous health checks detect issues instantly. Failed backends are bypassed automatically and brought back when healthy.

03

Drop-in compatible

Full OpenAI API support with streaming. Just change your base_url — no code changes needed in your application.

How It Works

Three steps to unified inference routing

Get started in minutes with Docker or as a standalone binary.

1

Deploy Tensormux in front of your backends

Run Tensormux as a Docker container or binary. Point it at your inference backends via a simple YAML config.

docker compose up
2

Point your app to Tensormux

Change the OpenAI SDK base_url to your Tensormux host. No code changes beyond that single line.

baseURL: "http://tensormux:8080/v1"
3

Tensormux routes and exposes metrics

Requests are routed per your policy. Prometheus metrics, health status, and JSONL audit logs are available instantly.

GET /metrics GET /status

Integration

See how Tensormux simplifies your inference stack.

Without Tensormux

App → vLLM endpoint (hardcoded URL #1)
App → SGLang endpoint (hardcoded URL #2)
App → TRT-LLM endpoint (hardcoded URL #3)

Bespoke routing logic, manual failover, fragmented metrics.

yaml
gateway:
  host: 0.0.0.0
  port: 8080
  strategy: least_inflight

backends:
  - name: vllm-fast
    url: http://vllm-fast:8000
    engine: vllm
    model: llama-3.1-8b
    weight: 80
    tags: ["fast", "gpu-a10"]
    health_endpoint: /v1/models

  - name: sglang-cheap
    url: http://sglang-cheap:8000
    engine: sglang
    model: llama-3.1-8b
    weight: 20
    tags: ["cheap", "gpu-t4"]
    health_endpoint: /v1/models

health:
  interval_s: 5
  timeout_s: 2
  fail_threshold: 2
  success_threshold: 1

logging:
  level: info
  jsonl_path: tensormux.jsonl

Everything you need, open source

Tensormux OSS is production-ready. No feature gates, no usage limits.

Open Source

Available now
  • Routing policies (least inflight, EWMA, weighted round-robin)
  • Health checking and automatic failover
  • OpenAI-compatible API passthrough
  • SSE streaming support
  • Prometheus metrics endpoint
  • Status and health endpoints
  • YAML-based configuration
  • Audit logging (JSONL)
Enterprise preview

Managed offering

Multi-tenant dashboards, RBAC, SLO monitoring, audit trails, and managed infrastructure. Currently in design — talk to us about your setup.

Preview the console

Try the interactive playground

Talk to us about your inference setup.

Open Source

Ready to simplify inference routing?

Deploy Tensormux in minutes. One config file, one endpoint, full control. Self-hosted, no vendor lock-in.

Enterprise

Running inference at scale?

Let's talk about your inference setup — multi-model deployments, GPU tier routing, failover requirements, and what a managed control plane would look like for your team.

No sales pitch. Just an honest conversation about your stack.