Skip to main content
This page shows how to evaluate Online-Mind2Web results.

Evaluation Command

uv run scripts/eval.py \
  --agent <agent_name> \
  --benchmark Online-Mind2Web \
  [options]

Basic Evaluation

uv run scripts/eval.py \
  --agent browser-use \
  --benchmark Online-Mind2Web

Evaluate Data Subset

# Evaluate Hard30 subset
uv run scripts/eval.py \
  --agent browser-use \
  --benchmark Online-Mind2Web \
  --version 20251214 \
  --split Hard30

Force Re-evaluation

uv run scripts/eval.py \
  --agent browser-use \
  --benchmark Online-Mind2Web \
  --force-reeval

Results

Results are saved in experiments/Online-Mind2Web/<Agent>/<Timestamp>/tasks_eval_result/.