Skip to main content

Getting Started

This guide will help you set up and run browseruse-bench.

Prerequisites

  • Python 3.8+
  • Node.js (for Agent-TARS)
  • PostgreSQL (optional, for database integration)

Installation

1. Clone the Repository

git clone https://github.com/lexmount/browseruse-bench.git
cd browseruse-bench

2. Install Dependencies

# Install core package
pip install -e .

# Install with browser-use agent support
pip install -e ".[browser-use]"

# Install all optional dependencies
pip install -e ".[all]"

3. Install Agent-TARS (Optional)

npm install -g @agent-tars/[email protected]

4. Configure Environment

cp .env.example .env
vim .env  # Edit with your API keys
Required environment variables:
  • OPENAI_API_KEY: OpenAI API key for evaluation
  • LEXMOUNT_API_KEY: Lexmount cloud browser API key
  • LEXMOUNT_PROJECT_ID: Lexmount project ID

5. Configure Agents

cp agents/Agent-TARS/config.yaml.example agents/Agent-TARS/config.yaml
cp agents/browser-use/config.yaml.example agents/browser-use/config.yaml
vim agents/Agent-TARS/config.yaml
vim agents/browser-use/config.yaml

Quick Start

Run a Benchmark

# Run first 3 tasks of Online-Mind2Web with Agent-TARS
uv run scripts/run.py --agent Agent-TARS --benchmark Online-Mind2Web --mode first_n --count 3

# Run LexBench-Browser (no login required subset)
uv run scripts/run.py --agent browser-use --benchmark LexBench-Browser --split no_login --mode first_n --count 5

Evaluate Results

# Evaluate LexBench-Browser results
uv run scripts/eval.py --agent browser-use --benchmark LexBench-Browser

# Evaluate with specific threshold
uv run scripts/eval.py --agent browser-use --benchmark LexBench-Browser --score-threshold 70
Logs: Script execution logs are saved in output/logs/.
  • run.py: output/logs/run/
  • eval.py: output/logs/eval/
  • generate_leaderboard.py: output/logs/leaderboard/

Generate Leaderboard

uv run scripts/generate_leaderboard.py

Next Steps