Directory structure
Each benchmark should follow this layout:Step 1: Create task data
tasks.json format
Required fields
| Field | Type | Description |
|---|---|---|
task_id | string | Unique task ID |
task | string | Task description |
Optional fields
| Field | Type | Description |
|---|---|---|
website | string | Target website |
category | string | Task category |
expected_result | string | Expected outcome |
requires_login | boolean | Whether login is required |
difficulty | string | Difficulty level |
Step 2: Create the data info file
data_info.json
Split file strategy
- Pre-generate subset files (e.g.,
tasks_easy.json) and list them undersplit. - Each split entry points to a file relative to the benchmark
data/directory.
Step 3: Create an evaluator (optional)
If you need custom evaluation logic, addevaluator.py:
Step 4: Register the benchmark
Register inbrowseruse_bench/benchmarks/__init__.py:
Step 5: Test
Full examples
Reference existing benchmarks:benchmarks/LexBench-Browser/- Full benchmark implementationbenchmarks/Online-Mind2Web/- Mind2Web integration examplebenchmarks/BrowseComp/- Simple benchmark example