Overview
Use Cases:- You have results from custom agent implementations
- You want to use standardized evaluation metrics
- You’re preparing to submit to the Leaderboard
- Standardized evaluation pipelines for each benchmark
- Consistent scoring metrics
- Detailed performance reports
Universal Data Requirements
Directory Structure
All benchmarks require this basic structure:Universal result.json Format
All benchmarks require these base fields:task_id: Unique task identifier matching the benchmarktask: Human-readable task descriptionanswer: Agent’s final response or answermodel_id: LLM model used (e.g., “gpt-4o”, “claude-3.5”)browser_id: Browser configuration (e.g., “Chrome-Local”)metrics: Optional performance metrics
Benchmark-Specific Requirements
LexBench-Browser
Evaluation Method: Visual assessment using screenshot sequencesScoring: 0-100 scale, default threshold: 60
Additional Requirements
✅ Required: Screenshot files intrajectory/ directory✅ Format: PNG or JPG images
✅ Naming: Sequential numbering (e.g.,
0.png, 1.png, …)
Example result.json
Required Directory Structure
Evaluation Command
Online-Mind2Web
Evaluation Method: WebJudge multi-round evaluationScoring: 3-point scale, default threshold: 3
Additional Requirements
✅ Required:action_history field in result.json✅ Required: Screenshot files in
trajectory/ directory✅ Format: Action history as string array
Example result.json
Required Directory Structure
Evaluation Command
BrowseComp
Evaluation Method: Text answer accuracy comparisonScoring: Binary (correct/incorrect)
Additional Requirements
❌ Not Required: Screenshots (text-only evaluation)✅ Required: Complete
answer field with full response
Example result.json
Minimal Directory Structure
Evaluation Command
Evaluation Process
Step 1: Prepare Your Data
- Organize Results: Structure your data according to the requirements above
- Place in Standard Location: Copy to the appropriate experiments directory
- Verify Format: Ensure all required fields are present
Step 2: Run Evaluation
Execute the evaluation command:--model: Evaluation LLM (default:EVAL_MODEL_NAME, fallbackgpt-4o)--score-threshold: Custom success threshold--force-reeval: Force re-evaluation of existing results
Step 3: Review Results
Evaluation generates two output files in thetasks_eval_result/ directory:
1. Detailed Results (*_eval_results.json):
*_summary.json):
Comparison Table
| Feature | LexBench-Browser | Online-Mind2Web | BrowseComp |
|---|---|---|---|
| Screenshots Required | ✅ Yes | ✅ Yes | ❌ No |
| Action History Required | ❌ No | ✅ Yes | ❌ No |
| Evaluation Type | Visual (LLM) | Multi-round (LLM) | Text comparison (LLM) |
| Scoring Scale | 0-100 | 1-3 | Binary |
| Default Threshold | 60 | 3 | N/A |
Difference from Leaderboard
| Feature | Evaluation Service | Leaderboard |
|---|---|---|
| Purpose | Evaluation tool | Results display |
| Function | Process data → Generate metrics | Browse & compare results |
| Interaction | Submit data for evaluation | Read-only viewing |
| Output | Detailed evaluation reports | Rankings & trends |
💡 Tip: After evaluation, you can submit your results to the Leaderboard for public comparison with other models.
Best Practices
✅ Do’s
- Validate Format: Double-check all required fields before evaluation
- Use Consistent IDs: Ensure
task_idmatches benchmark tasks exactly - Include Metrics: Add performance metrics for richer analysis
- Document Model: Specify
model_idandbrowser_idfor reproducibility
❌ Don’ts
- Don’t mix formats: Each benchmark has specific requirements
- Don’t skip screenshots: LexBench-Browser and Online-Mind2Web need them
- Don’t modify benchmark data: Use original task definitions
- Don’t ignore errors: Address validation errors before proceeding
Troubleshooting
”Task ID not found in benchmark”
Cause: Yourtask_id doesn’t match any benchmark taskSolution: Use exact task IDs from the benchmark’s
tasks.json
”Missing action_history field”
Cause: Online-Mind2Web requires action historySolution: Add
action_history array to your result.json
”No screenshots found”
Cause: trajectory/ directory is missing or emptySolution: Ensure screenshots are in
task_dir/trajectory/*.png
”Evaluation failed with API error”
Cause: Evaluation LLM API issuesSolution: Check
OPENAI_API_KEY and network connectivity
Next Steps
- View Examples: See Complete Workflow for end-to-end guide
- Submit to Leaderboard: Share results publicly (Guide)
- Custom Benchmarks: Create your own evaluation tasks (Documentation)