Skip to main content
All benchmark data is from the reproducible benchmark suite. Clone the repo and run ./run.sh to reproduce every number on this page.

Methodology

  • Both servers spawned via StdioClientTransport — identical to how any MCP client connects - Real websites (saucedemo.com, books.toscrape.com, etc.), not synthetic test pages - Each scenario: 1 warmup run (excluded) + N measured runs, median selected - Token estimation: ceil(responsePayloadBytes / 4) (approximate heuristic) - All tool calls timed with performance.now()
These benchmarks measure raw MCP tool efficiency, not prompt quality or LLM decision-making. All tool arguments are pre-written. In real agent flows, both tools would need additional calls for the LLM to decide what to do.
Each scenario uses Playwright’s most efficient tools: browser_fill_form for batched fills, browser_evaluate for direct JS extraction. Call counts are not artificially inflated.

Three-Variant Model

The benchmarks use three variants to separate BAP’s core advantage (composite actions) from its optimization layer (fused operations):
VariantRulesWhat It Measures
BAP StandardMust observe before acting, use refs from observe output. Re-observe after page navigation.Apples-to-apples with Playwright
BAP FusedCan use semantic selectors without prior observe. Can use fused navigate(observe:true) and act(postObserve:true).BAP’s full optimization layer
PlaywrightStandard snapshot-then-act workflow. Uses most efficient tools available.Baseline
The fair comparison is BAP Standard vs Playwright. Both follow the same observe-then-act pattern. BAP Fused is explicitly an optimization layer and is not an apples-to-apples comparison.

Results

ScenarioSiteBAP StandardBAP FusedPlaywrightStd vs PWFused vs PW
baselinequotes.toscrape.com222TieTie
observenews.ycombinator.com212Tie-50%
extractbooks.toscrape.com222TieTie
formthe-internet.herokuapp.com435-20%-40%
ecommercesaucedemo.com8511-27%-55%
workflowbooks.toscrape.com545Tie-20%
Total231727~15%~37%

Scenario Breakdown

Baseline (2 vs 2)

Navigate to a page and take a screenshot. Both tools require exactly 2 calls. No difference.

Observe (2 vs 2, fused: 1)

Navigate and get a page snapshot. Standard BAP and Playwright both need 2 calls (navigate + snapshot/observe). BAP Fused does it in 1 call via navigate(observe:true).

Extract (2 vs 2)

Navigate and extract structured data. BAP uses extract with a schema. Playwright uses browser_evaluate with custom JS. Same call count.

Form (4 vs 5)

Fill a login form and submit. BAP Standard: observe + act (fill + fill + click batched) = 4 calls. Playwright: snapshot + fill_form + click + snapshot = 5 calls. BAP’s composite act saves 1 call.

Ecommerce (8 vs 11)

Multi-page shopping flow: login, browse, add to cart, checkout. BAP Standard saves 3 calls through composite actions. BAP Fused saves 6 calls via fusion on top.

Workflow (5 vs 5, fused: 4)

Navigate, interact, extract across multiple pages. Standard BAP ties with Playwright. BAP Fused saves 1 call via navigate+observe fusion.

Where BAP Wins

  • Composite act: Batching multiple steps (fill + fill + click) into one call is the primary advantage. Most impactful in multi-step flows.
  • Fused operations: navigate(observe:true) and act(postObserve:true) eliminate redundant server roundtrips.
  • Structured extract: JSON Schema-based extraction vs writing custom JS for browser_evaluate.

Where Playwright Wins

  • Per-call latency: Playwright MCP is a single process. BAP’s two-process WebSocket architecture adds ~50-200ms per call. Playwright wins wall-clock time on most scenarios.
  • Element disambiguation: Playwright’s positional snapshot refs uniquely identify elements. BAP’s observe can return ambiguous selectors for identical elements (e.g., 6 “Add to cart” buttons on saucedemo.com).
  • Setup simplicity: npx @playwright/mcp — single process, no daemon management.

Known BAP Limitations

Discovered during benchmarking:
  • Identical elements: 6 “Add to cart” buttons on saucedemo.com produce ambiguous text:Add to cart selectors, causing Playwright strict mode violations. Must use CSS ID selectors.
  • Missing accessible names: Cart icon on saucedemo.com has no accessible name — BAP observe cannot discover it. Must navigate directly to /cart.html.
  • Fused navigate on SPAs: navigate(observe:true) can return empty on first call in a new MCP session for SPA sites. Workaround: preflight navigate("about:blank").
  • Default maxElements: maxElements=50 truncates observe output on large pages (books.toscrape.com has 100+ sidebar category links before the pagination link).

Fairness Statement

These benchmarks are designed to be honest, not promotional:
  • BAP Standard is the fair comparison. It follows the same observe-then-act pattern as Playwright.
  • Latency favors Playwright. BAP’s two-process architecture adds WebSocket overhead per call.
  • Token estimation is approximate. ceil(bytes / 4) is a rough heuristic. Screenshots inflate counts due to base64 encoding.
  • No LLM involved. All tool arguments are pre-written.
  • BAP extract uses heuristics. Playwright’s browser_evaluate runs precise DOM queries and may return more accurate results.
The bottom line: BAP Standard uses ~15% fewer tool calls than Playwright in an apples-to-apples comparison, primarily from batching multi-step actions. BAP Fused extends this to ~37% through fusion. Playwright wins on per-call latency and element disambiguation.