Mux ships with a headless adapter for Terminal-Bench. The adapter runs the Electron backend without opening a window and exercises it through the same IPC paths we use in integration tests. This page documents how to launch benchmarks from the repository tree.Documentation Index
Fetch the complete documentation index at: https://mux-sidebar-t7ry.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- Docker must be installed and running. Terminal-Bench executes each task inside a dedicated Docker container.
uvis available in the nixdevShell(provided viaflake.nix), or install it manually from docs.astral.sh/uv.- Standard provider API keys (e.g.
ANTHROPIC_API_KEY,OPENAI_API_KEY) should be exported so Mux can stream responses.
| Variable | Purpose | Default |
|---|---|---|
MUX_AGENT_REPO_ROOT | Path copied into each task container | repo root inferred from the agent file |
MUX_TRUNK | Branch checked out when preparing the project | main |
MUX_WORKSPACE_ID | Workspace identifier used inside Mux | mux-bench |
MUX_MODEL | Preferred model (supports provider/model syntax) | anthropic/claude-sonnet-4-5 |
MUX_THINKING_LEVEL | Reasoning level (OFF, LOW, MED, HIGH, MAX) | HIGH |
MUX_MODE | Starting mode (plan or exec) | exec |
MUX_RUNTIME | Runtime type (local, worktree, or ssh <host>) | worktree |
MUX_TIMEOUT_MS | Optional stream timeout in milliseconds | no timeout |
MUX_PROVIDERS_FILE | Host path to providers.jsonc copied into each sandbox | unset (use env vars only) |
MUX_CONFIG_ROOT | Location for Mux session data inside the container | /root/.mux |
MUX_APP_ROOT | Path where the Mux sources are staged | /opt/mux-app |
MUX_PROJECT_PATH | Explicit project directory inside the task container | auto-detected from common paths |
Running Terminal-Bench
All commands below should be run from the repository root.Quick smoke test (single task)
Full dataset
make:
TB_DATASET defaults to terminal-bench-core==0.1.1, but can be overridden (e.g. make benchmark-terminal TB_DATASET=terminal-bench-core==head).
Use --agent-kwarg mode=plan to exercise the plan/execute workflow—the CLI will gather a plan first, then automatically approve it and switch to execution. Leaving the flag off (or setting mode=exec) skips the planning phase.
Use TB_CONCURRENCY=<n> to control --n-concurrent (number of concurrently running tasks) and TB_LIVESTREAM=1 to stream log output live instead of waiting for the run to finish. These map to Terminal-Bench’s --n-concurrent and --livestream flags.
How the Adapter Works
The adapter lives inbenchmarks/terminal_bench/mux_agent.py. For each task it:
- Copies the Mux repository (package manifests +
src/) into/tmp/mux-appinside the container. - Ensures Bun exists, then runs
bun install --frozen-lockfile. - Launches
mux run(src/cli/run.ts) to prepare workspace metadata and stream the instruction, storing state underMUX_CONFIG_ROOT(default/root/.mux).
MUX_MODEL accepts either the Mux colon form (anthropic:claude-sonnet-4-5) or the Terminal-Bench slash form (anthropic/claude-sonnet-4-5); the adapter normalises whichever you provide.
Troubleshooting
command not found: bun– ensure the container can reach Bun’s install script, or pre-install Bun in your base image. The adapter aborts if the install step fails.- Workspace creation errors – set
MUX_PROJECT_PATHto the project directory inside the task container if auto-discovery misses it. - Streaming timeouts – pass
--n-tasks 1while iterating on fixes, or setMUX_TIMEOUT_MS=180000to reinstate a timeout if needed.