Skip to main content

Online-Mind2Web

Online-Mind2Web is a benchmark for evaluating web interaction tasks on real websites.

Overview

AttributeValue
SourceMind2Web dataset
Task TypeWeb navigation and interaction
WebsitesReal-world websites

Quick Start

# Run tasks
uv run scripts/run.py --agent Agent-TARS --benchmark Online-Mind2Web --mode first_n --count 3

# Evaluate results
uv run scripts/eval.py --agent Agent-TARS --benchmark Online-Mind2Web

Evaluation

Uses WebJudge for evaluation with semantic matching of agent actions.