Skip to main content
This page shows how to evaluate LexBench-Browser results.

Evaluation Command

uv run scripts/eval.py \
  --agent <agent_name> \
  --benchmark LexBench-Browser \
  [options]

Evaluation Strategies

stepwise

Evaluates using all screenshots step by step:
uv run scripts/eval.py \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --eval_strategy stepwise

final

Evaluates only the final screenshot for efficiency:
uv run scripts/eval.py \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --eval_strategy final

Score Threshold

LexBench-Browser uses a 0-100 scoring system:
# Default threshold 60
uv run scripts/eval.py --agent browser-use --benchmark LexBench-Browser

# Use 70 threshold (stricter)
uv run scripts/eval.py --agent browser-use --benchmark LexBench-Browser --score-threshold 70

Evaluate Data Subset

# Evaluate no-login subset
uv run scripts/eval.py \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --split no_login

Results

Results are saved in experiments/LexBench-Browser/<Agent>/<Timestamp>/tasks_eval_result/. Each task result contains:
  • predicted_label: 1 = success, 0 = failure
  • score: 0-100
  • grader_response: Detailed evaluation response
  • failure_category: Failure category (if applicable)