Features
Multi-metric Comparison
Success rate, steps, time, token usage
Interactive UI
Filtering, sorting, and detailed views
Task-level Analysis
Inspect per-task execution details and trajectories
Error Analysis
Categorize and visualize failure cases
Quickstart
Generate leaderboard
Start server
UI Preview
Overview
Shows success rate, steps, and time for each Agent x Benchmark combination:- Compare multiple agents
- Click a row to view task details
- Click error category bars to filter failures
Task details
Each task includes:- Task ID and description
- Action history (expandable)
- Trajectory screenshots (paginated)
- Time and token statistics
- Evaluation results and error analysis
Submission format
If you want to submit your own results, use the following structure:Directory structure
result.json format
Evaluation output
After submission, the system evaluates and generates:predicted_label: 1 = success, 0 = failureevaluation_details: score, grader response, failure category