Benchmarks

All benchmark data is from the reproducible benchmark suite. Clone the repo and run ./run.sh to reproduce every number on this page.

Methodology

Test Setup

Both servers spawned via StdioClientTransport — identical to how any MCP client connects - Real websites (saucedemo.com, books.toscrape.com, etc.), not synthetic test pages - Each scenario: 1 warmup run (excluded) + N measured runs, median selected - Token estimation: ceil(responsePayloadBytes / 4) (approximate heuristic) - All tool calls timed with performance.now()

No LLM Involved

These benchmarks measure raw MCP tool efficiency, not prompt quality or LLM decision-making. All tool arguments are pre-written. In real agent flows, both tools would need additional calls for the LLM to decide what to do.

Playwright Uses Best Available Tools

Each scenario uses Playwright’s most efficient tools: browser_fill_form for batched fills, browser_evaluate for direct JS extraction. Call counts are not artificially inflated.

Three-Variant Model

The benchmarks use three variants to separate BAP’s core advantage (composite actions) from its optimization layer (fused operations):

Variant	Rules	What It Measures
BAP Standard	Must observe before acting, use refs from observe output. Re-observe after page navigation.	Apples-to-apples with Playwright
BAP Fused	Can use semantic selectors without prior observe. Can use fused `navigate(observe:true)` and `act(postObserve:true)`.	BAP’s full optimization layer
Playwright	Standard snapshot-then-act workflow. Uses most efficient tools available.	Baseline

The fair comparison is BAP Standard vs Playwright. Both follow the same observe-then-act pattern. BAP Fused is explicitly an optimization layer and is not an apples-to-apples comparison.

Results

Scenario	Site	BAP Standard	BAP Fused	Playwright	Std vs PW	Fused vs PW
baseline	quotes.toscrape.com	2	2	2	Tie	Tie
observe	news.ycombinator.com	2	1	2	Tie	-50%
extract	books.toscrape.com	2	2	2	Tie	Tie
form	the-internet.herokuapp.com	4	3	5	-20%	-40%
ecommerce	saucedemo.com	8	5	11	-27%	-55%
workflow	books.toscrape.com	5	4	5	Tie	-20%
Total		23	17	27	~15%	~37%

Scenario Breakdown

Baseline (2 vs 2)

Navigate to a page and take a screenshot. Both tools require exactly 2 calls. No difference.

Observe (2 vs 2, fused: 1)

Navigate and get a page snapshot. Standard BAP and Playwright both need 2 calls (navigate + snapshot/observe). BAP Fused does it in 1 call via navigate(observe:true).

Extract (2 vs 2)

Navigate and extract structured data. BAP uses extract with a schema. Playwright uses browser_evaluate with custom JS. Same call count.

Form (4 vs 5)

Fill a login form and submit. BAP Standard: observe + act (fill + fill + click batched) = 4 calls. Playwright: snapshot + fill_form + click + snapshot = 5 calls. BAP’s composite act saves 1 call.

Ecommerce (8 vs 11)

Multi-page shopping flow: login, browse, add to cart, checkout. BAP Standard saves 3 calls through composite actions. BAP Fused saves 6 calls via fusion on top.

Workflow (5 vs 5, fused: 4)

Navigate, interact, extract across multiple pages. Standard BAP ties with Playwright. BAP Fused saves 1 call via navigate+observe fusion.

Where BAP Wins

Composite act: Batching multiple steps (fill + fill + click) into one call is the primary advantage. Most impactful in multi-step flows.
Fused operations: navigate(observe:true) and act(postObserve:true) eliminate redundant server roundtrips.
Structured extract: JSON Schema-based extraction vs writing custom JS for browser_evaluate.

Where Playwright Wins

Per-call latency: Playwright MCP is a single process. BAP’s two-process WebSocket architecture adds ~50-200ms per call. Playwright wins wall-clock time on most scenarios.
Element disambiguation: Playwright’s positional snapshot refs uniquely identify elements. BAP’s observe can return ambiguous selectors for identical elements (e.g., 6 “Add to cart” buttons on saucedemo.com).
Setup simplicity: npx @playwright/mcp — single process, no daemon management.

Known BAP Limitations

Discovered during benchmarking:

Identical elements: 6 “Add to cart” buttons on saucedemo.com produce ambiguous text:Add to cart selectors, causing Playwright strict mode violations. Must use CSS ID selectors.
Missing accessible names: Cart icon on saucedemo.com has no accessible name — BAP observe cannot discover it. Must navigate directly to /cart.html.
Fused navigate on SPAs: navigate(observe:true) can return empty on first call in a new MCP session for SPA sites. Workaround: preflight navigate("about:blank").
Default maxElements: maxElements=50 truncates observe output on large pages (books.toscrape.com has 100+ sidebar category links before the pagination link).

Fairness Statement

These benchmarks are designed to be honest, not promotional:

BAP Standard is the fair comparison. It follows the same observe-then-act pattern as Playwright.
Latency favors Playwright. BAP’s two-process architecture adds WebSocket overhead per call.
Token estimation is approximate. ceil(bytes / 4) is a rough heuristic. Screenshots inflate counts due to base64 encoding.
No LLM involved. All tool arguments are pre-written.
BAP extract uses heuristics. Playwright’s browser_evaluate runs precise DOM queries and may return more accurate results.

The bottom line: BAP Standard uses ~15% fewer tool calls than Playwright in an apples-to-apples comparison, primarily from batching multi-step actions. BAP Fused extends this to ~37% through fusion. Playwright wins on per-call latency and element disambiguation.

Overview

Getting Started

Core Concepts

Guides

Comparisons

Methodology

Three-Variant Model

Results

Scenario Breakdown

Baseline (2 vs 2)

Observe (2 vs 2, fused: 1)

Extract (2 vs 2)

Form (4 vs 5)

Ecommerce (8 vs 11)

Workflow (5 vs 5, fused: 4)

Where BAP Wins

Where Playwright Wins

Known BAP Limitations

Fairness Statement

Overview

Getting Started

Core Concepts

Guides

Comparisons

Documentation Index

​Methodology

​Three-Variant Model

​Results

​Scenario Breakdown

​Baseline (2 vs 2)

​Observe (2 vs 2, fused: 1)

​Extract (2 vs 2)

​Form (4 vs 5)

​Ecommerce (8 vs 11)

​Workflow (5 vs 5, fused: 4)

​Where BAP Wins

​Where Playwright Wins

​Known BAP Limitations

​Fairness Statement

Methodology

Three-Variant Model

Results

Scenario Breakdown

Baseline (2 vs 2)

Observe (2 vs 2, fused: 1)

Extract (2 vs 2)

Form (4 vs 5)

Ecommerce (8 vs 11)

Workflow (5 vs 5, fused: 4)

Where BAP Wins

Where Playwright Wins

Known BAP Limitations

Fairness Statement