Overview
- Select an agent
- Configure the agent
- Choose a benchmark
- Run tasks
- Evaluate results
- Inspect outputs
1. Select an Agent
browseruse-bench supports multiple agents. Choose one based on your needs:2. Configure Agent
Copy the example config and edit it:configs/agents/browser-use/config.yaml:
configs/agents/browser-use/config.yaml.exampleconfigs/agents/Agent-TARS/config.yaml.example
Agent config files are plain YAML; environment variables are not auto-substituted.
3. Select Benchmark
Choose a benchmark based on your evaluation needs:LexBench-Browser
- Evaluation Method: Visual assessment (screenshot sequence analysis)
- Scoring: 0-100 scale, default threshold: 60
- Use Case: Visual understanding and multi-step reasoning
Online-Mind2Web
- Evaluation Method: WebJudge multi-round evaluation
- Scoring: 3-point scale, default threshold: 3
- Use Case: Web navigation and task completion
BrowseComp
- Evaluation Method: Text answer accuracy
- Scoring: Binary (correct/incorrect)
- Use Case: Factual accuracy and information extraction
4. Run Tasks
Basic Command
All Parameters
| Parameter | Description | Notes |
|---|---|---|
--agent | Agent name | Defaults to config.yaml default.agent (fallback Agent-TARS) |
--benchmark | Benchmark name | Defaults to config.yaml default.benchmark (fallback Online-Mind2Web) |
--split | Dataset split | Defaults to All |
--data-source | Dataset source | local (default) or huggingface |
--force-download | Re-download dataset | Only for huggingface |
--mode | Task selection mode | single, first_n, sample_n, specific, by_id, all |
--count | Task count for first_n/sample_n | Defaults to 1 |
--task-ids | Task IDs for specific mode | Space-separated list |
--id | Single task ID for by_id mode | Numeric ID field |
--timeout | Timeout per task (seconds) | Overrides TIMEOUT in config |
--skip-completed | Skip tasks with existing results | Useful for resumes |
--agent-config | Custom agent config path | Defaults to configs/agents/<agent>/config.yaml |
--timestamp | Run or resume in a specific directory | YYYYMMDD_HHmmss |
--dry-run | Show the command without executing | Configuration check |
Output Structure
Results are saved to:Monitoring Progress
Log files are created underoutput/logs/run/:
5. Evaluate Results
Run Evaluation
Evaluation Parameters
| Parameter | Description | Default |
|---|---|---|
--model | Evaluation LLM model | EVAL_MODEL_NAME or gpt-4o |
--score-threshold | Success threshold | 60 (LexBench), 3 (others) |
--force-reeval | Force re-evaluation | false |
--timestamp | Evaluate a specific run | Latest (auto-detected) |
--data-source | Dataset source (LexBench only) | local |
--force-download | Re-download dataset (LexBench only) | false |
Output Files
Evaluation results are written totasks_eval_result/:
- Detailed Results:
*_eval_results.json - Summary Statistics:
*_summary.json
Review Results
Complete Example
Common Issues
Timeout Errors
Problem: Tasks exceed configured timeout Solution: IncreaseTIMEOUT in the agent config or pass --timeout.
Missing Screenshots (LexBench-Browser)
Problem: Evaluation fails due to missing screenshots Solution: Confirmtasks/<task_id>/trajectory/ contains screenshots and check the run logs for task failures.
Model API Errors
Problem: LLM API calls fail Solution: Verify API keys in the agent config; for evaluation, check.env values.
Next Steps
- Custom Benchmarks: Learn how to create your own benchmark (Guide)
- Leaderboard: Submit results to the public leaderboard (Details)
- Advanced Configuration: Explore advanced agent settings (Documentation)