快速开始

前置要求

Python 3.11+
Node.js 18+（Agent-TARS 需要）
uv（推荐的 Python 包管理器）

安装

克隆仓库

git clone https://github.com/user/browseruse-bench.git
cd browseruse-bench

安装依赖 (Python≥3.11)

使用 uv（推荐）：

uv sync

或使用 pip：

pip install -e .

配置 API Keys

cp .env.example .env
vim .env

必需的环境变量：

# 评估模型 API（用于自动评估）
OPENAI_API_KEY=your_openai_api_key

# Lexmount 云浏览器（可选，用于云端运行）
LEXMOUNT_API_KEY=your_lexmount_api_key
LEXMOUNT_PROJECT_ID=your_project_id

安装 Agent-TARS（可选）

npm install -g @agent-tars/[email protected]

配置 Agent

cp agents/Agent-TARS/config.yaml.example agents/Agent-TARS/config.yaml
cp agents/browser-use/config.yaml.example agents/browser-use/config.yaml

根据需要编辑配置文件填写 API Key。

快速运行

运行你的第一个 Benchmark

# 运行 LexBench-Browser 前 3 个任务（推荐：无需登录）
uv run scripts/run.py \
  --agent browser-use \
  --benchmark LexBench-Browser \
  --mode first_n \
  --count 3

评估结果

# 评估 LexBench-Browser 结果
uv run scripts/eval.py --agent browser-use --benchmark LexBench-Browser

# 使用自定义评分阈值
uv run scripts/eval.py --agent browser-use --benchmark LexBench-Browser --score-threshold 70

日志: 脚本执行日志保存在 output/logs/ 目录下。

run.py: output/logs/run/

eval.py: output/logs/eval/

generate_leaderboard.py: output/logs/leaderboard/

生成排行榜

# 自动收集所有评估结果，生成 HTML 排行榜
uv run scripts/generate_leaderboard.py

# 启动本地服务器查看
uv run scripts/benchmark_server.py
# 访问 http://localhost:8000

运行模式

模式	说明	示例
`first_n`	运行前 N 个任务	`--mode first_n --count 5`
`sample_n`	随机抽样 N 个任务	`--mode sample_n --count 10`
`specific`	运行指定 ID 的任务	`--mode specific --task-ids id1,id2`
`all`	运行所有任务	`--mode all`
`single`	运行单个任务	`--mode single --task-ids id1`

常用参数

uv run scripts/run.py \
  --agent browser-use \           # Agent 名称
  --benchmark LexBench-Browser \  # Benchmark 名称
  --mode first_n \                # 运行模式
  --count 5 \                     # 任务数量
  --timeout 300 \                 # 超时时间（秒）
  --skip-completed \              # 跳过已完成的任务
  --debug                         # 调试模式

下一步

支持的 Agents

了解可用的浏览器代理

Benchmarks 详解

深入了解各个基准测试

云浏览器配置

使用 Lexmount 云浏览器

查看排行榜

对比不同 Agent 的性能

开始

功能

示例

开发

前置要求

安装

快速运行

运行你的第一个 Benchmark

评估结果

生成排行榜

运行模式

常用参数

下一步

支持的 Agents

Benchmarks 详解

云浏览器配置

查看排行榜

开始

功能

示例

开发

​前置要求

​安装

​快速运行

​运行你的第一个 Benchmark

​评估结果

​生成排行榜

​运行模式

​常用参数

​下一步

支持的 Agents

Benchmarks 详解

云浏览器配置

查看排行榜

前置要求

安装

快速运行

运行你的第一个 Benchmark

评估结果

生成排行榜

运行模式

常用参数

下一步