Supported Benchmarks
LexBench-Browser
Recommended - Evaluation benchmark for Chinese websites with 387 tasks (v2.0). L1 is a no-login subset for quick runs.
Online-Mind2Web
Online evaluation based on the Mind2Web dataset, testing agents’ navigation and interaction capabilities on real websites.
BrowseComp
Browser operation competition tasks, evaluating agents’ comprehensive browser operation capabilities.
Feature Comparison
| Benchmark | Tasks | Language | Evaluation | Login Required |
|---|---|---|---|---|
| LexBench-Browser | 387 | zh/en | LLM (visual) | Partial |
| Online-Mind2Web | 300 | English | WebJudge | No |
| BrowseComp | 1266 | English | Grader | No |
Quick Comparison Run
Data Location
All benchmark data is stored in thebenchmarks/ directory:
| Benchmark | Data File Path |
|---|---|
| LexBench-Browser | benchmarks/LexBench-Browser/data/ |
| Online-Mind2Web | benchmarks/Online-Mind2Web/data/ |
| BrowseComp | benchmarks/BrowseComp/data/ |
Planned Support
- More benchmarks
If you’d like to add a new benchmark, please refer to the Custom Benchmark guide.