Skip to main content
browseruse-bench integrates agents through a BaseAgent interface. The framework handles task loading, CLI parsing, workspaces, and result persistence — you only need to implement run_task() and register your agent.

How existing agents are integrated

1. browser-use — Python SDK (in-process)

Interface: directly import browser_use Python package, runs async in-process. task → BrowserUseAgent.run_task() → create LLM instance (OpenAI/Gemini/Anthropic, etc.) → open_browser_session() open browser session → browser_use.Agent(task, llm, browser).run() → parse history, return AgentResult Advantage: deepest integration — provides token usage, per-step screenshots, and full action history.

2. Skyvern — Python SDK + embedded service

Interface: import skyvern, requires local PostgreSQL for auth, supports both local and cloud modes. task → SkyvernAgent.prepare() (init DB auth) → SkyvernAgent.run_task() → Skyvern.local() or Skyvern(api_key=...) → skyvern.run_task(prompt, engine, ...) → poll run_id until complete → collect screenshots from artifacts dir, return AgentResult Note: heaviest dependency (requires PostgreSQL), but allows substituting your own LLM for Skyvern’s cloud model.

3. Agent-TARS — subprocess CLI

Interface: invoke agent-tars run --input "..." ... CLI tool, fully black-box. task → AgentTARSAgent.run_task() → assemble CLI args → subprocess.Popen(["agent-tars", "run", ...]) → wait for process exit (with timeout) → parse event-stream.jsonl to extract actions/metrics → return AgentResult Advantage: lightest weight — no need to understand agent internals, only requires an executable CLI.

What a new agent must expose / how to integrate

Required interface (choose one)

MethodRequirement
Python SDKpip install-able package with a run(task)-style API that returns a result
HTTP APIREST interface to submit tasks and poll for results
CLIExecutable command that accepts a task argument and outputs a result file on exit

Required output (AgentResult)

AgentResult(
    task_id=...,
    env_status="success" | "failed",   # whether the environment is healthy
    agent_done="done" | "timeout" | "error" | "max_steps",
    answer="...",          # agent's final answer (used for evaluation)
    action_history=[...],  # list of action steps
    screenshots=[...],     # list of trajectory screenshot filenames
    metrics=AgentMetrics(
        end_to_end_ms=...,
        steps=...,
    ),
)

Files to create

browseruse_bench/agents/<agent_module>.py   # BaseAgent subclass
browseruse_bench/agents/__init__.py         # add import
config.yaml                                # root config registration (recommended)
configs/agent_registry.yaml                # register venv, entrypoint, supported_benchmarks
pyproject.toml                             # optional: add new optional dependency

Two integration methods

1. Code Agent integration

You don’t need to implement it manually — just tell the Code Agent what you want to integrate, referencing an existing agent’s integration pattern.

Example: integrating Claude Code as a browser agent

Here is an example prompt given to a Code Agent:
I want to integrate claude-code as a browser agent in my evaluation framework, referencing Agent-TARS’s subprocess CLI approach, with the following constraints:
  • Follow Agent-TARS’s subprocess CLI integration pattern
  • Give me prerequisites first; proceed with the implementation only after I confirm they’re installed
  • Add config entries in the root config.yaml and register the agent environment in configs/agent_registry.yaml
  • The output directory must include result.json (agent result), trajectory/ (screenshots), and api_logs/ (intermediate steps)
  • Additional constraints:
    • Output JSON events line by line; type=result contains the final answer, type=user contains tool results (including screenshot base64)
    • Restrict the agent to Playwright browser tools only, to prevent it from completing tasks via search APIs
    • Skip permission prompts in non-interactive mode
  • Provide a run example after integration is complete
The following is the Code Agent’s output (note: this is not the full output, just selected excerpts): Prerequisites:
# 1. Install Claude Code CLI
npm install -g @anthropic-ai/claude-code

# 2. Authenticate
claude auth login

# 3. Add Playwright MCP at user scope (so subprocesses can find it)
claude mcp add playwright --scope user -- npx @playwright/mcp@latest
Implementation script Code changes Added browseruse_bench/agents/claude_code.py Modified browseruse_bench/agents/__init__.py Registered in config.yaml:
agents:
  my-code-agent:
    active_model: default
    models:
      default:
        model_id: your-model-id   # note: model_id, not model
        api_key: $YOUR_API_KEY
    defaults:
      max_turns: 50
      timeout: 300
      allowed_tools: "mcp__playwright*"
Run script
bubench run \
  --agent claude-code \
  --benchmark LexBench-Browser \
  --mode first_n \
  --count 1

2. Manual integration

For self-developed agents, Python SDK agents, or HTTP API agents that need to implement browser control themselves.

Integration steps

Step 1: Implement the agent

# browseruse_bench/agents/my_agent.py
from __future__ import annotations

import logging
from pathlib import Path
from typing import Any, Dict

from browseruse_bench.agents.base import BaseAgent
from browseruse_bench.agents.registry import register_agent
from browseruse_bench.schemas import AgentMetrics, AgentResult

logger = logging.getLogger(__name__)


@register_agent
class MyAgent(BaseAgent):
    name = "my-agent"

    def run_task(
        self,
        task_info: Dict[str, Any],
        agent_config: Dict[str, Any],
        task_workspace: Path,
    ) -> AgentResult:
        task_id = task_info.get("task_id", "unknown")
        task_text = task_info.get("task_text", "")

        # ... agent logic ...

        return AgentResult(
            task_id=task_id,
            env_status="success",
            agent_done="done",
            answer="Final answer here",
            action_history=[],
            metrics=AgentMetrics(end_to_end_ms=0, steps=0),
        )

Step 2: Ensure the module is imported

Registration happens at import time. Add to browseruse_bench/agents/__init__.py:
from browseruse_bench.agents import my_agent  # noqa: F401

Step 3: Register in config.yaml

Add model config under agents in the root config.yaml:
agents:
  my-agent:
    active_model: default
    models:
      default:
        model_id: gpt-4.1
        api_key: $OPENAI_API_KEY
    defaults:
      timeout: 300
      max_steps: 50
Also register metadata in configs/agent_registry.yaml:
my-agent:
  path: browseruse_bench/agents
  entrypoint: browseruse_bench/runner/agent_runner.py
  venv: .venv
  supported_benchmarks:
    - Online-Mind2Web
    - LexBench-Browser

Step 4: Run a quick test

bubench run \
  --agent my-agent \
  --benchmark Online-Mind2Web \
  --mode first_n \
  --count 1

Reference: field descriptions

What is task_info?

A dictionary loaded from the benchmark dataset with normalized fields:
  • task_id (string)
  • task_text (string)
  • url (string)
  • prompt (string, optional)

What is agent_config?

Loaded from agents.<agent>.models[active_model] in the root config.yaml (or via --agent-config). The framework injects timeout_seconds.

What is task_workspace?

The per-task output directory: <output_dir>/tasks/<task_id>/. Store screenshots, logs, and intermediate artifacts here; the framework writes result.json to the same directory.

Browser backend constraints

If your agent needs browser access (manual integration mode), follow the unified backend contract:
from browseruse_bench.browsers import open_browser_session

browser_id = agent_config.get("browser_id") or "Chrome-Local"
with open_browser_session(browser_id=browser_id, agent_name=self.name, agent_config=agent_config) as ctx:
    # use ctx.cdp_url or ctx.transport to build the browser instance
    ...
Provider lifecycle code (create/destroy session) lives in browseruse_bench/browsers/providers/. Cleanup failures should be logged and tolerated — they must not mask errors from the task execution phase.