Skip to main content
browseruse-bench provides automated leaderboard generation to compare agent performance across benchmarks.

Features

Multi-metric Comparison

Success rate, steps, time, token usage

Interactive UI

Filtering, sorting, and detailed views

Task-level Analysis

Inspect per-task execution details and trajectories

Error Analysis

Categorize and visualize failure cases

Quickstart

Generate leaderboard

# Collect all evaluation results and generate HTML leaderboard
uv run scripts/generate_leaderboard.py

Start server

# Foreground run for local development
uv run scripts/benchmark_server.py
# Visit http://localhost:8000

UI Preview

Overview

Shows success rate, steps, and time for each Agent x Benchmark combination:
  • Compare multiple agents
  • Click a row to view task details
  • Click error category bars to filter failures

Task details

Each task includes:
  • Task ID and description
  • Action history (expandable)
  • Trajectory screenshots (paginated)
  • Time and token statistics
  • Evaluation results and error analysis

Submission format

If you want to submit your own results, use the following structure:

Directory structure

experiments/
`-- <BenchmarkName>/
    `-- <AgentName>/
        `-- <Timestamp>/           # e.g., 20251208_114207
            `-- tasks/
                `-- <task_id>/
                    |-- result.json     # Required: task run result
                    `-- trajectory/     # Optional: screenshot sequence
                        |-- 0_screenshot.png
                        `-- ...

result.json format

{
  "task_id": "005be9dd91c95669d6ddde9ae667125c",
  "task": "Search for iPhone 15 on Taobao",
  "action_history": ["Open Taobao", "Type iPhone 15", "Click search"],
  "model_id": "gpt-4o",
  "browser_id": "Chrome-Local",
  "metrics": {
    "steps": 5,
    "end_to_end_ms": 9879,
    "usage": {
      "total_tokens": 1234,
      "total_cost": 0.0123
    }
  },
  "config": {
    "timeout_seconds": 300
  }
}

Evaluation output

After submission, the system evaluates and generates:
experiments/
`-- <BenchmarkName>/
    `-- <AgentName>/
        `-- <Timestamp>/
            |-- tasks/                    # Raw data
            `-- tasks_eval_result/        # Auto-generated
                `-- <EvalName>_results.json
Added fields:
  • predicted_label: 1 = success, 0 = failure
  • evaluation_details: score, grader response, failure category

Service commands

# Check service status
sudo bash scripts/manage_benchmark_service.sh status

# View logs
sudo bash scripts/manage_benchmark_service.sh logs

# Restart service
sudo bash scripts/manage_benchmark_service.sh restart

# Stop service
sudo bash scripts/manage_benchmark_service.sh stop