Skip to main content

Prerequisites

  • Python 3.11+
  • Node.js 18+ (only for Agent-TARS)
  • uv (recommended Python package manager)

Installation

1

Clone the repository

git clone https://github.com/lexmount/browseruse-bench.git
cd browseruse-bench
2

Install Python dependencies

# Install core dependencies and register the bubench CLI
uv sync
Activate the venv so bubench is on PATH:
source .venv/bin/activate
Windows PowerShell:
.venv\Scripts\Activate.ps1
bubench run will create the agent venv defined in config.yaml (built-in defaults: .venvs/browser_use, .venvs/skyvern, .venvs/agent_tars) and install the matching dependencies on first use. Agent venv must be configured explicitly (no fallback to .venv). If uv is not available, creation/install falls back to python -m venv and pip.
3

Configure environment (.env)

cp .env.example .env
Edit .env and set evaluation and optional cloud settings:
# Evaluation (required for eval.py)
OPENAI_API_KEY=your_openai_api_key
EVAL_MODEL_NAME=gpt-4.1
EVAL_MODEL_BASE_URL=https://api.openai.com/v1

# Lexmount cloud browser (optional)
LEXMOUNT_API_KEY=your_lexmount_api_key
LEXMOUNT_PROJECT_ID=your_project_id

# AgentBay cloud browser (optional, only for BROWSER_ID=agentbay)
AGENTBAY_API_KEY=your_agentbay_api_key
Tip: If you are in China, set HF_ENDPOINT=https://hf-mirror.com to speed up HuggingFace downloads.
4

Configure agent credentials

cp configs/agents/browser-use/config.yaml.example configs/agents/browser-use/config.yaml
cp configs/agents/Agent-TARS/config.yaml.example configs/agents/Agent-TARS/config.yaml
Notes:
  • configs/agents/browser-use/config.yaml needs MODEL_TYPE, MODEL_ID, and the matching API key (BROWSER_USE_API_KEY, OPENAI_API_KEY, or GEMINI_API_KEY).
  • If BROWSER_ID=agentbay, set AGENTBAY_API_KEY in .env (do not put it in config.yaml).
  • configs/agents/Agent-TARS/config.yaml needs MODEL_PROVIDER, MODEL_ID, and MODEL_APIKEY (plus MODEL_BASEURL if required).
  • Agent config files are read as plain YAML; environment variables are not auto-substituted.
5

Install Agent-TARS CLI (optional)

npm install -g @agent-tars/cli@0.3.0
6

Install skills (optional)

bubench skills

Quick Run

Run your first benchmark

# Run the first 3 tasks of LexBench-Browser (L1 no-login subset)
bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split L1 \
  --mode first_n \
  --count 3
Add --dry-run to verify configuration without executing tasks:
bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --mode single \
  --dry-run

Evaluate results

# Evaluate LexBench-Browser results
bubench eval --agent browser-use --benchmark LexBench-Browser

# Use a custom score threshold
bubench eval --agent browser-use --benchmark LexBench-Browser --score-threshold 70
Logs: Script execution logs are saved in output/logs/.
  • run.py: output/logs/run/
  • eval.py: output/logs/eval/
  • leaderboard: output/logs/leaderboard/

Generate leaderboard

# Collect all evaluation results and generate the HTML leaderboard
bubench leaderboard

# Start a local server to view it
bubench server
# Visit http://localhost:8000

Run Modes

ModeDescriptionExample
singleRun the first task (sanity check)--mode single
first_nRun the first N tasks--mode first_n --count 5
sample_nRandomly sample N tasks--mode sample_n --count 10
specificRun specified task IDs--mode specific --task-ids id1 id2
by_idRun one task by numeric ID field--mode by_id --id 123
allRun all tasks--mode all
Note: --task-ids expects a space-separated list.

Common Parameters

bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split All \
  --mode first_n \
  --count 5 \
  --timeout 600 \
  --skip-completed \
  --dry-run
Additional flags:
  • --data-source: local or huggingface.
  • --force-download: Force re-download in HuggingFace mode.
  • --agent-config: Custom agent config path (defaults to configs/agents/<agent>/config.yaml).
  • --timestamp: Resume or run in a specific directory (YYYYMMDD_HHmmss).
--timeout overrides TIMEOUT in the agent config.

Running Multiple Agents in Parallel

bubench run uses the venv specified by the agent entry in config.yaml and will auto-create/install dependencies on first use. By default each built-in agent has a dedicated venv:
  • browser-use -> .venvs/browser_use
  • skyvern -> .venvs/skyvern
  • Agent-TARS -> .venvs/agent_tars
If an agent entry does not define venv, bubench run exits with an error instead of falling back to .venv. If you need to run conflicting agents at the same time, open two terminals and run each agent with its own venv.

Node.js Agents (No Conflicts)

Agent-TARS runs via a Node.js CLI and does not share Python dependencies with other agents. You can run it in any terminal after installing the CLI.
bubench run --agent Agent-TARS ...

Next Steps

Supported Agents

Explore available browser agents

Benchmarks

Learn about each benchmark

Cloud Browser Setup

Configure Lexmount cloud browser

View Leaderboard

Compare agent performance