Documentation

Get started with Tensormux in minutes.

Install

Clone the repository and run with Docker Compose or install from source.

Docker Compose
git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
docker compose up --build
From source (Python)
git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
pip install -e .

Quickstart

Create a config file, start the gateway, and point your OpenAI SDK at it.

1. Create config.yaml

config.yaml
gateway:
  host: 0.0.0.0
  port: 8080
  strategy: least_inflight

backends:
  - name: vllm-fast
    url: http://vllm-fast:8000
    engine: vllm
    model: llama-3.1-8b
    weight: 80
    health_endpoint: /v1/models
    tags: ["fast", "gpu-a10"]

  - name: sglang-cheap
    url: http://sglang-cheap:8000
    engine: sglang
    model: llama-3.1-8b
    weight: 20
    health_endpoint: /v1/models
    tags: ["cheap", "gpu-t4"]

health:
  interval_s: 5
  timeout_s: 2
  fail_threshold: 2
  success_threshold: 1

logging:
  level: info
  jsonl_path: tensormux.jsonl

2. Start the gateway

Docker Compose
services:
  tensormux:
    build: .
    ports:
      - "8080:8080"
    environment:
      - TENSORMUX_CONFIG=/app/config.yaml
    volumes:
      - ./config.yaml:/app/config.yaml:ro

3. Point your OpenAI SDK

TypeScript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY ?? "not-used-for-oss-backends",
  baseURL: "http://YOUR_TENSORMUX_HOST:8080/v1",
});

Configuration overview

gateway.strategy

Routing strategy for distributing requests across backends.

least_inflightewma_latencyweighted_round_robin
backends[].name

Unique name for the backend. Used in logs and metrics.

backends[].url

Base URL of the inference backend (e.g., http://vllm:8000).

backends[].engine

Inference engine type. Used for tagging only.

vllmsglangtensorrt-llm
backends[].weight

Weight for weighted round-robin routing. Higher values receive more traffic.

backends[].health_endpoint

HTTP path used for health checks on this backend. Defaults to /v1/models.

backends[].tags

List of string tags for labeling and filtering backends (e.g., region, GPU tier).

health.interval_s

Seconds between health check probes for each backend.

health.fail_threshold

Number of consecutive failures before marking a backend unhealthy.

health.success_threshold

Number of consecutive successes before marking an unhealthy backend healthy again.

logging.level

Log verbosity level.

debuginfowarningerror
logging.jsonl_path

File path for JSONL audit logs. Logs all routed requests with backend, latency, and status.

Full reference documentation is available in the GitHub repository.