Local Default
Uses local data by default. HuggingFace mode downloads into the HF cache (~/.cache/huggingface).
JSONL Format
Uses efficient JSONL format for data storage, supporting streaming processing.
Data Source Configuration
1. Standard Benchmarks
For LexBench-Browser and Online-Mind2Web, configure HuggingFace details inbenchmarks/{benchmark}/data/data_info.json:
benchmarks/{benchmark}/data/ and can include subdirectories
(e.g., LexBench-Browser/, LexBench-Online_Mind2Web/, or date folders).
2. BrowseComp (Local or HuggingFace)
BrowseComp supports local JSONL files or HuggingFace downloads. When using HuggingFace, the parquet file is downloaded into the HF cache and converted to JSONL for use.hf_repo_id: Dataset repo ID.hf_path_prefix: Subdirectory inside the repo (e.g.,data).hf_filename: Parquet file name.hf_revision(optional): Repo revision.hf_private(optional): Set totrueif the repo requires a token.
CLI Usage
bubench run and bubench eval support the --data-source argument to control data loading behavior:
| Mode | Description |
|---|---|
local (Default) | Uses local files. Errors if files are missing. Suitable for offline usage. |
huggingface | Downloads from HuggingFace and uses the HF cache (default ~/.cache/huggingface). |
--force-download | With huggingface, forces a re-download into the HF cache. |
- Local and HuggingFace storage are separate. HF downloads stay in the cache and are not copied into
benchmarks/.... --force-downloadonly applies tohuggingfacemode.- BrowseComp HuggingFace data is parquet and is converted to JSONL in the HF cache.
Run Examples
Evaluation Examples (LexBench-Browser)
bubench eval passes --data-source only for LexBench-Browser. Other benchmarks use results files or local paths.
Environment Variables
When using private datasets, you must configure theHF_TOKEN environment variable.
~/.cache/huggingface by default. You can override this with HF_HOME or HF_HUB_CACHE.
Data Format
JSONL Format
To improve efficiency with large files, we use JSONL (JSON Lines) format, where each line is an independent JSON object.tasks.jsonl
Directory Structure
Troubleshooting
Authentication error for private datasets
Authentication error for private datasets
Error:
Private HuggingFace dataset requires authenticationSolution: Ensure HF_TOKEN environment variable is set.Slow download speed
Slow download speed
Option 1: If you are in mainland China, use an HF mirror:Option 2: Manually download files and place them in the corresponding
benchmarks/{name}/data/{split_path} directory.