Leaderboard

browseruse-bench provides automated leaderboard generation to compare agent performance across benchmarks.

Features

Multi-metric Comparison

Success rate, steps, time, token usage

Interactive UI

Filtering, sorting, and detailed views

Task-level Analysis

Inspect per-task execution details and trajectories

Error Analysis

Categorize and visualize failure cases

Quickstart

Generate leaderboard

# Collect all evaluation results and generate HTML leaderboard
uv run scripts/generate_leaderboard.py

Start server

# Foreground run for local development
uv run scripts/benchmark_server.py
# Visit http://localhost:8000

UI Preview

Overview

Shows success rate, steps, and time for each Agent x Benchmark combination:

Compare multiple agents
Click a row to view task details
Click error category bars to filter failures

Task details

Each task includes:

Task ID and description
Action history (expandable)
Trajectory screenshots (paginated)
Time and token statistics
Evaluation results and error analysis

Submission format

If you want to submit your own results, use the following structure:

Directory structure

experiments/
`-- <BenchmarkName>/
    `-- <AgentName>/
        `-- <Timestamp>/           # e.g., 20251208_114207
            `-- tasks/
                `-- <task_id>/
                    |-- result.json     # Required: task run result
                    `-- trajectory/     # Optional: screenshot sequence
                        |-- 0_screenshot.png
                        `-- ...

result.json format

{
  "task_id": "005be9dd91c95669d6ddde9ae667125c",
  "task": "Search for iPhone 15 on Taobao",
  "action_history": ["Open Taobao", "Type iPhone 15", "Click search"],
  "model_id": "gpt-4o",
  "browser_id": "Chrome-Local",
  "metrics": {
    "steps": 5,
    "end_to_end_ms": 9879,
    "usage": {
      "total_tokens": 1234,
      "total_cost": 0.0123
    }
  },
  "config": {
    "timeout_seconds": 300
  }
}

Evaluation output

After submission, the system evaluates and generates:

experiments/
`-- <BenchmarkName>/
    `-- <AgentName>/
        `-- <Timestamp>/
            |-- tasks/                    # Raw data
            `-- tasks_eval_result/        # Auto-generated
                `-- <EvalName>_results.json

Added fields:

predicted_label: 1 = success, 0 = failure
evaluation_details: score, grader response, failure category

Service commands

# Check service status
sudo bash scripts/manage_benchmark_service.sh status

# View logs
sudo bash scripts/manage_benchmark_service.sh logs

# Restart service
sudo bash scripts/manage_benchmark_service.sh restart

# Stop service
sudo bash scripts/manage_benchmark_service.sh stop

Get Started

Features

Examples

Development

Features

Multi-metric Comparison

Interactive UI

Task-level Analysis

Error Analysis

Quickstart

Generate leaderboard

Start server

UI Preview

Overview

Task details

Submission format

Directory structure

result.json format

Evaluation output

Service commands

Get Started

Features

Examples

Development

​Features

Multi-metric Comparison

Interactive UI

Task-level Analysis

Error Analysis

​Quickstart

​Generate leaderboard

​Start server

​UI Preview

​Overview

​Task details

​Submission format

​Directory structure

​result.json format

​Evaluation output

​Service commands

Features

Quickstart

Generate leaderboard

Start server

UI Preview

Overview

Task details

Submission format

Directory structure

result.json format

Evaluation output

Service commands