Documentation

Get started with Tensormux in minutes.

Install

Pull the Docker image or build from source.

Docker

docker pull ghcr.io/krxgu/tensormux:latest

From source

git clone https://github.com/KrxGu/Tensormux.git
cd Tensormux
go build -o tensormux ./cmd/tensormux

Quickstart

Create a config file, start the gateway, and point your OpenAI SDK at it.

1. Create tensormux.yaml

tensormux.yaml

gateway:
  host: 0.0.0.0
  port: 8080
  strategy: least_inflight

backends:
  - name: vllm-fast
    url: http://vllm-fast:8000
    engine: vllm
    model: llama-3.1-8b
    weight: 80

  - name: sglang-cheap
    url: http://sglang-cheap:8000
    engine: sglang
    model: llama-3.1-8b
    weight: 20

health_check:
  interval_s: 5
  timeout_s: 2
  fail_threshold: 2
  success_threshold: 1

2. Start the gateway

Docker Compose

services:
  tensormux:
    image: ghcr.io/krxgu/tensormux:latest
    ports:
      - "8080:8080"
    volumes:
      - ./tensormux.yaml:/app/tensormux.yaml:ro
    command: ["--config", "/app/tensormux.yaml"]

3. Point your OpenAI SDK

TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY ?? "not-used-for-oss-backends",
  baseURL: "http://YOUR_TENSORMUX_HOST:8080/v1",
});

Configuration overview

gateway.strategy

Routing strategy for distributing requests across backends.

least_inflightewma_latencyweighted_round_robin

backends[].name

Unique name for the backend. Used in logs and metrics.

backends[].url

Base URL of the inference backend (e.g., http://vllm:8000).

backends[].engine

Inference engine type. Used for tagging only.

vllmsglangtensorrt-llm

backends[].weight

Weight for weighted round-robin routing. Higher values receive more traffic.

health_check.interval_s

Seconds between health check probes for each backend.

health_check.fail_threshold

Number of consecutive failures before marking a backend unhealthy.

health_check.success_threshold

Number of consecutive successes before marking an unhealthy backend healthy again.

Full reference documentation is available in the GitHub repository.