Tensormux v1.0 is live on GitHub. Open-source inference gateway, no strings attached. Star us →
Documentation

Get started with Tensormux

Deploy in minutes. One config file, one endpoint.

Install

Get Tensormux running

Clone the repository and run with Docker Compose or install from source.

Docker Compose
git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
docker compose up --build
From source (Python)
git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
pip install -e .
Quickstart

Three steps to route inference

Create a config file, start the gateway, and point your OpenAI SDK at it.

1

Create config.yaml

config.yaml
gateway:
  host: 0.0.0.0
  port: 8080
  strategy: least_inflight

backends:
  - name: vllm-fast
    url: http://vllm-fast:8000
    engine: vllm
    model: llama-3.1-8b
    weight: 80
    health_endpoint: /v1/models
    tags: ["fast", "gpu-a10"]

  - name: sglang-cheap
    url: http://sglang-cheap:8000
    engine: sglang
    model: llama-3.1-8b
    weight: 20
    health_endpoint: /v1/models
    tags: ["cheap", "gpu-t4"]

health:
  interval_s: 5
  timeout_s: 2
  fail_threshold: 2
  success_threshold: 1

logging:
  level: info
  jsonl_path: tensormux.jsonl
2

Start the gateway

Docker Compose
services:
  tensormux:
    build: .
    ports:
      - "8080:8080"
    environment:
      - TENSORMUX_CONFIG=/app/config.yaml
    volumes:
      - ./config.yaml:/app/config.yaml:ro
3

Point your OpenAI SDK

TypeScript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY ?? "not-used-for-oss-backends",
  baseURL: "http://YOUR_TENSORMUX_HOST:8080/v1",
});
Reference

Configuration overview

All fields supported by tensormux.yaml.

gateway.strategy
least_inflightewma_latencyweighted_round_robin

Routing strategy for distributing requests across backends.

backends[].name

Unique name for the backend. Used in logs and metrics.

backends[].url

Base URL of the inference backend (e.g., http://vllm:8000).

backends[].engine
vllmsglangtensorrt-llm

Inference engine type. Used for tagging only.

backends[].weight

Weight for weighted round-robin routing. Higher values get more traffic.

backends[].health_endpoint

HTTP path used for health checks. Defaults to /v1/models.

backends[].tags

List of string tags for labeling and filtering (e.g., region, GPU tier).

health.interval_s

Seconds between health check probes per backend.

health.fail_threshold

Consecutive failures before marking a backend unhealthy.

health.success_threshold

Consecutive successes before restoring a backend to healthy.

logging.level
debuginfowarningerror

Log verbosity level.

logging.jsonl_path

File path for JSONL audit logs. Records every routed request with backend, latency, and status.

Full reference documentation

Source code, contributing guide, and full API docs are in the GitHub repository.

View on GitHub →