Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.bubench.lexmount.io/llms.txt

Use this file to discover all available pages before exploring further.

BrowseComp is a benchmark for browser operation competition tasks, evaluating agents’ comprehensive browser operation capabilities.

Overview

AttributeValue
Task TypeBrowser operations
EvaluationGrader-based scoring
DifficultyMedium-High

Features

Competition-grade Tasks

Tasks from browser operation competitions with high difficulty

Comprehensive Skills

Tests a wide range of browser operation capabilities

Quick Start

Run Tasks

# Run first 3 tasks
bubench run \
  --agent browser-use \
  --data BrowseComp \
  --mode first_n \
  --count 3

# Run with Agent-TARS
bubench run \
  --agent Agent-TARS \
  --data BrowseComp \
  --mode first_n \
  --count 3

Evaluate Results

bubench eval --agent browser-use --data BrowseComp --model-id bu-2-0

Data Loading

BrowseComp supports local JSONL files or HuggingFace downloads. To use HuggingFace:
bubench run --agent browser-use --data BrowseComp \
  --data-source huggingface
The HuggingFace parquet file is converted to JSONL in the HF cache before use.

Evaluation Metrics

MetricDescription
Task CompletionPercentage of tasks completed
AccuracyResult accuracy

Data Format

Task data is stored in benchmarks/BrowseComp/data/:
{
  "task_id": "browsecomp_001",
  "task": "Navigate to the website and complete the registration form",
  "expected_result": "Registration successful"
}