Skip to main content
This guide walks you through the complete workflow from agent configuration to final evaluation results.

Overview

  1. Select an agent
  2. Configure the agent
  3. Choose a benchmark
  4. Run tasks
  5. Evaluate results
  6. Inspect outputs

1. Select an Agent

browseruse-bench supports multiple agents. Choose one based on your needs:
AgentDescriptionDocumentation
browser-useProgrammable browser agent with vision capabilitiesDetails
Agent-TARSReasoning-focused agent via Node.js CLIDetails

2. Configure Agent

Copy the example config and edit it:
cp configs/agents/browser-use/config.yaml.example configs/agents/browser-use/config.yaml
Example configs/agents/browser-use/config.yaml:
# Model Configuration
MODEL_TYPE: OPENAI
MODEL_ID: gpt-4o
OPENAI_API_KEY: your_openai_api_key

# Browser Configuration
BROWSER_ID: Chrome-Local
LEXMOUNT_BROWSER_MODE: normal

# Agent Parameters
USE_VISION: true
MAX_STEPS: 50
TIMEOUT: 600
Reference config files:
  • configs/agents/browser-use/config.yaml.example
  • configs/agents/Agent-TARS/config.yaml.example
Agent config files are plain YAML; environment variables are not auto-substituted.

3. Select Benchmark

Choose a benchmark based on your evaluation needs:

LexBench-Browser

  • Evaluation Method: Visual assessment (screenshot sequence analysis)
  • Scoring: 0-100 scale, default threshold: 60
  • Use Case: Visual understanding and multi-step reasoning

Online-Mind2Web

  • Evaluation Method: WebJudge multi-round evaluation
  • Scoring: 3-point scale, default threshold: 3
  • Use Case: Web navigation and task completion

BrowseComp

  • Evaluation Method: Text answer accuracy
  • Scoring: Binary (correct/incorrect)
  • Use Case: Factual accuracy and information extraction

4. Run Tasks

Basic Command

bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split All

All Parameters

ParameterDescriptionNotes
--agentAgent nameDefaults to config.yaml default.agent (fallback Agent-TARS)
--benchmarkBenchmark nameDefaults to config.yaml default.benchmark (fallback Online-Mind2Web)
--splitDataset splitDefaults to All
--data-sourceDataset sourcelocal (default) or huggingface
--force-downloadRe-download datasetOnly for huggingface
--modeTask selection modesingle, first_n, sample_n, specific, by_id, all
--countTask count for first_n/sample_nDefaults to 1
--task-idsTask IDs for specific modeSpace-separated list
--idSingle task ID for by_id modeNumeric ID field
--timeoutTimeout per task (seconds)Overrides TIMEOUT in config
--skip-completedSkip tasks with existing resultsUseful for resumes
--agent-configCustom agent config pathDefaults to configs/agents/<agent>/config.yaml
--timestampRun or resume in a specific directoryYYYYMMDD_HHmmss
--dry-runShow the command without executingConfiguration check

Output Structure

Results are saved to:
experiments/{benchmark}/{split}/{agent}/{timestamp}/
├── tasks/
│   ├── <task_id>/
│   │   ├── result.json
│   │   └── trajectory/
│   │       ├── screenshot-1.png
│   │       └── ...
└── tasks_eval_result/
    └── *_summary.json

Monitoring Progress

Log files are created under output/logs/run/:
ls -t output/logs/run | head -n 1
# then tail the newest file

5. Evaluate Results

Run Evaluation

bubench eval \
  --benchmark LexBench-Browser \
  --agent browser-use \
  --split All
The script automatically finds the latest results for the specified agent and benchmark.

Evaluation Parameters

ParameterDescriptionDefault
--modelEvaluation LLM modelEVAL_MODEL_NAME or gpt-4o
--score-thresholdSuccess threshold60 (LexBench), 3 (others)
--force-reevalForce re-evaluationfalse
--timestampEvaluate a specific runLatest (auto-detected)
--data-sourceDataset source (LexBench only)local
--force-downloadRe-download dataset (LexBench only)false

Output Files

Evaluation results are written to tasks_eval_result/:
  • Detailed Results: *_eval_results.json
  • Summary Statistics: *_summary.json

Review Results

cat experiments/{benchmark}/{split}/{agent}/{timestamp}/tasks_eval_result/*_summary.json

Complete Example

# 1. Configure agent (edit config file first)
vim configs/agents/browser-use/config.yaml

# 2. Run inference (first 10 tasks)
bubench run \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split All \
  --mode first_n \
  --count 10

# 3. Evaluate results
bubench eval \
  --benchmark LexBench-Browser \
  --agent browser-use \
  --split All

# 4. View summary
ls -lh experiments/LexBench-Browser/All/browser-use/*/tasks_eval_result/

Common Issues

Timeout Errors

Problem: Tasks exceed configured timeout Solution: Increase TIMEOUT in the agent config or pass --timeout.

Missing Screenshots (LexBench-Browser)

Problem: Evaluation fails due to missing screenshots Solution: Confirm tasks/<task_id>/trajectory/ contains screenshots and check the run logs for task failures.

Model API Errors

Problem: LLM API calls fail Solution: Verify API keys in the agent config; for evaluation, check .env values.

Next Steps

  • Custom Benchmarks: Learn how to create your own benchmark (Guide)
  • Leaderboard: Submit results to the public leaderboard (Details)
  • Advanced Configuration: Explore advanced agent settings (Documentation)