Features
Trajectory Playback
Browse step-by-step screenshots for each task
Evaluation Details
View eval prompts, scores, verdicts, and rubric criteria
API Log Inspection
Inspect per-step API calls and system prompts
Judge Experiment Sets
Compare evaluation methods across tasks with variance analysis
Quick Start
Start the server
Options
| Flag | Default | Description |
|---|---|---|
--host | 127.0.0.1 | Bind address (use 0.0.0.0 to expose to the network) |
--port | 8080 | Server port |
--watch | off | Auto-regenerate index when experiment files change |
--watch-interval | 3.0 | Watch poll interval in seconds |
--generate-only | off | Regenerate experiments.json and exit without starting the server |
Security note: The server binds to127.0.0.1by default so only the local machine can reach it. The/api/regenerateendpoint is unauthenticated and/experiments/*serves raw files (logs, screenshots, configs). Only pass--host 0.0.0.0on trusted networks — see the Remote / Intranet Sharing section below.
Generate index only
experiments/ and writes browseruse_bench/visualization/data/experiments.json. Useful for CI or pre-generating before serving.
Experiment Directory Layout
The visualization server reads the same experiment directory structure as the leaderboard:Remote / Intranet Sharing
Run the server in a tmux session so it stays alive after you disconnect from SSH: Install tmux (if not already installed):0.0.0.0, the startup log prints the detected LAN URL on its first lines — attach with tmux attach -t viz to read it. To look up the IP manually:
http://<server-ip>:8090/ in your browser.
Firewall (if other machines cannot connect):
Leaderboard vs. Visualization
| Leaderboard | Visualization | |
|---|---|---|
| Purpose | Agent ranking overview | Task-level detail exploration |
| Output | Self-contained HTML file | Dynamic SPA served locally |
| Granularity | Run-level aggregates | Per-task trajectories and logs |
| Sharing | Share the HTML file directly | Run the server on a shared host |