./run.sh to reproduce every number on this page.
Methodology
Test Setup
Test Setup
- Both servers spawned via
StdioClientTransport— identical to how any MCP client connects - Real websites (saucedemo.com, books.toscrape.com, etc.), not synthetic test pages - Each scenario: 1 warmup run (excluded) + N measured runs, median selected - Token estimation:ceil(responsePayloadBytes / 4)(approximate heuristic) - All tool calls timed withperformance.now()
No LLM Involved
No LLM Involved
These benchmarks measure raw MCP tool efficiency, not prompt quality or LLM decision-making. All
tool arguments are pre-written. In real agent flows, both tools would need additional calls for
the LLM to decide what to do.
Playwright Uses Best Available Tools
Playwright Uses Best Available Tools
Each scenario uses Playwright’s most efficient tools:
browser_fill_form for batched fills,
browser_evaluate for direct JS extraction. Call counts are not artificially inflated.Three-Variant Model
The benchmarks use three variants to separate BAP’s core advantage (composite actions) from its optimization layer (fused operations):| Variant | Rules | What It Measures |
|---|---|---|
| BAP Standard | Must observe before acting, use refs from observe output. Re-observe after page navigation. | Apples-to-apples with Playwright |
| BAP Fused | Can use semantic selectors without prior observe. Can use fused navigate(observe:true) and act(postObserve:true). | BAP’s full optimization layer |
| Playwright | Standard snapshot-then-act workflow. Uses most efficient tools available. | Baseline |
Results
| Scenario | Site | BAP Standard | BAP Fused | Playwright | Std vs PW | Fused vs PW |
|---|---|---|---|---|---|---|
| baseline | quotes.toscrape.com | 2 | 2 | 2 | Tie | Tie |
| observe | news.ycombinator.com | 2 | 1 | 2 | Tie | -50% |
| extract | books.toscrape.com | 2 | 2 | 2 | Tie | Tie |
| form | the-internet.herokuapp.com | 4 | 3 | 5 | -20% | -40% |
| ecommerce | saucedemo.com | 8 | 5 | 11 | -27% | -55% |
| workflow | books.toscrape.com | 5 | 4 | 5 | Tie | -20% |
| Total | 23 | 17 | 27 | ~15% | ~37% |
Scenario Breakdown
Baseline (2 vs 2)
Navigate to a page and take a screenshot. Both tools require exactly 2 calls. No difference.Observe (2 vs 2, fused: 1)
Navigate and get a page snapshot. Standard BAP and Playwright both need 2 calls (navigate + snapshot/observe). BAP Fused does it in 1 call vianavigate(observe:true).
Extract (2 vs 2)
Navigate and extract structured data. BAP usesextract with a schema. Playwright uses browser_evaluate with custom JS. Same call count.
Form (4 vs 5)
Fill a login form and submit. BAP Standard: observe + act (fill + fill + click batched) = 4 calls. Playwright: snapshot + fill_form + click + snapshot = 5 calls. BAP’s compositeact saves 1 call.
Ecommerce (8 vs 11)
Multi-page shopping flow: login, browse, add to cart, checkout. BAP Standard saves 3 calls through composite actions. BAP Fused saves 6 calls via fusion on top.Workflow (5 vs 5, fused: 4)
Navigate, interact, extract across multiple pages. Standard BAP ties with Playwright. BAP Fused saves 1 call via navigate+observe fusion.Where BAP Wins
- Composite
act: Batching multiple steps (fill + fill + click) into one call is the primary advantage. Most impactful in multi-step flows. - Fused operations:
navigate(observe:true)andact(postObserve:true)eliminate redundant server roundtrips. - Structured
extract: JSON Schema-based extraction vs writing custom JS forbrowser_evaluate.
Where Playwright Wins
- Per-call latency: Playwright MCP is a single process. BAP’s two-process WebSocket architecture adds ~50-200ms per call. Playwright wins wall-clock time on most scenarios.
- Element disambiguation: Playwright’s positional snapshot refs uniquely identify elements. BAP’s observe can return ambiguous selectors for identical elements (e.g., 6 “Add to cart” buttons on saucedemo.com).
- Setup simplicity:
npx @playwright/mcp— single process, no daemon management.
Known BAP Limitations
Discovered during benchmarking:- Identical elements: 6 “Add to cart” buttons on saucedemo.com produce ambiguous
text:Add to cartselectors, causing Playwright strict mode violations. Must use CSS ID selectors. - Missing accessible names: Cart icon on saucedemo.com has no accessible name — BAP observe cannot discover it. Must navigate directly to
/cart.html. - Fused navigate on SPAs:
navigate(observe:true)can return empty on first call in a new MCP session for SPA sites. Workaround: preflightnavigate("about:blank"). - Default maxElements:
maxElements=50truncates observe output on large pages (books.toscrape.com has 100+ sidebar category links before the pagination link).
Fairness Statement
These benchmarks are designed to be honest, not promotional:- BAP Standard is the fair comparison. It follows the same observe-then-act pattern as Playwright.
- Latency favors Playwright. BAP’s two-process architecture adds WebSocket overhead per call.
- Token estimation is approximate.
ceil(bytes / 4)is a rough heuristic. Screenshots inflate counts due to base64 encoding. - No LLM involved. All tool arguments are pre-written.
- BAP
extractuses heuristics. Playwright’sbrowser_evaluateruns precise DOM queries and may return more accurate results.
The bottom line: BAP Standard uses ~15% fewer tool calls than Playwright in an apples-to-apples
comparison, primarily from batching multi-step actions. BAP Fused extends this to ~37% through
fusion. Playwright wins on per-call latency and element disambiguation.