> ## Documentation Index
> Fetch the complete documentation index at: https://piyushvyas.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Benchmarks

> Full methodology, 6 scenarios, 3 variants, and fair disclaimers for BAP vs Playwright tool call benchmarks.

All benchmark data is from the [reproducible benchmark suite](https://github.com/browseragentprotocol/benchmarks). Clone the repo and run `./run.sh` to reproduce every number on this page.

## Methodology

<AccordionGroup>
  <Accordion title="Test Setup">
    * Both servers spawned via `StdioClientTransport` -- identical to how any MCP client connects -
      **Real websites** (saucedemo.com, books.toscrape.com, etc.), not synthetic test pages - Each
      scenario: 1 warmup run (excluded) + N measured runs, median selected - Token estimation:
      `ceil(responsePayloadBytes / 4)` (approximate heuristic) - All tool calls timed with
      `performance.now()`
  </Accordion>

  <Accordion title="No LLM Involved">
    These benchmarks measure raw MCP tool efficiency, not prompt quality or LLM decision-making. All
    tool arguments are pre-written. In real agent flows, both tools would need additional calls for
    the LLM to decide what to do.
  </Accordion>

  <Accordion title="Playwright Uses Best Available Tools">
    Each scenario uses Playwright's most efficient tools: `browser_fill_form` for batched fills,
    `browser_evaluate` for direct JS extraction. Call counts are not artificially inflated.
  </Accordion>
</AccordionGroup>

## Three-Variant Model

The benchmarks use three variants to separate BAP's core advantage (composite actions) from its optimization layer (fused operations):

| Variant          | Rules                                                                                                                 | What It Measures                 |
| ---------------- | --------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| **BAP Standard** | Must observe before acting, use refs from observe output. Re-observe after page navigation.                           | Apples-to-apples with Playwright |
| **BAP Fused**    | Can use semantic selectors without prior observe. Can use fused `navigate(observe:true)` and `act(postObserve:true)`. | BAP's full optimization layer    |
| **Playwright**   | Standard snapshot-then-act workflow. Uses most efficient tools available.                                             | Baseline                         |

<Warning>
  **The fair comparison is BAP Standard vs Playwright.** Both follow the same observe-then-act
  pattern. BAP Fused is explicitly an optimization layer and is not an apples-to-apples comparison.
</Warning>

## Results

| Scenario      | Site                       | BAP Standard | BAP Fused | Playwright | Std vs PW | Fused vs PW |
| ------------- | -------------------------- | :----------: | :-------: | :--------: | :-------: | :---------: |
| baseline      | quotes.toscrape.com        |       2      |     2     |      2     |    Tie    |     Tie     |
| observe       | news.ycombinator.com       |       2      |     1     |      2     |    Tie    |     -50%    |
| extract       | books.toscrape.com         |       2      |     2     |      2     |    Tie    |     Tie     |
| form          | the-internet.herokuapp.com |       4      |     3     |      5     |    -20%   |     -40%    |
| **ecommerce** | **saucedemo.com**          |     **8**    |   **5**   |   **11**   |  **-27%** |   **-55%**  |
| workflow      | books.toscrape.com         |       5      |     4     |      5     |    Tie    |     -20%    |
| **Total**     |                            |    **23**    |   **17**  |   **27**   | **\~15%** |  **\~37%**  |

## Scenario Breakdown

### Baseline (2 vs 2)

Navigate to a page and take a screenshot. Both tools require exactly 2 calls. No difference.

### Observe (2 vs 2, fused: 1)

Navigate and get a page snapshot. Standard BAP and Playwright both need 2 calls (navigate + snapshot/observe). BAP Fused does it in 1 call via `navigate(observe:true)`.

### Extract (2 vs 2)

Navigate and extract structured data. BAP uses `extract` with a schema. Playwright uses `browser_evaluate` with custom JS. Same call count.

### Form (4 vs 5)

Fill a login form and submit. BAP Standard: observe + act (fill + fill + click batched) = 4 calls. Playwright: snapshot + fill\_form + click + snapshot = 5 calls. BAP's composite `act` saves 1 call.

### Ecommerce (8 vs 11)

Multi-page shopping flow: login, browse, add to cart, checkout. BAP Standard saves 3 calls through composite actions. BAP Fused saves 6 calls via fusion on top.

### Workflow (5 vs 5, fused: 4)

Navigate, interact, extract across multiple pages. Standard BAP ties with Playwright. BAP Fused saves 1 call via navigate+observe fusion.

## Where BAP Wins

* **Composite `act`**: Batching multiple steps (fill + fill + click) into one call is the primary advantage. Most impactful in multi-step flows.
* **Fused operations**: `navigate(observe:true)` and `act(postObserve:true)` eliminate redundant server roundtrips.
* **Structured `extract`**: JSON Schema-based extraction vs writing custom JS for `browser_evaluate`.

## Where Playwright Wins

* **Per-call latency**: Playwright MCP is a single process. BAP's two-process WebSocket architecture adds \~50-200ms per call. Playwright wins wall-clock time on most scenarios.
* **Element disambiguation**: Playwright's positional snapshot refs uniquely identify elements. BAP's observe can return ambiguous selectors for identical elements (e.g., 6 "Add to cart" buttons on saucedemo.com).
* **Setup simplicity**: `npx @playwright/mcp` -- single process, no daemon management.

## Known BAP Limitations

Discovered during benchmarking:

* **Identical elements**: 6 "Add to cart" buttons on saucedemo.com produce ambiguous `text:Add to cart` selectors, causing Playwright strict mode violations. Must use CSS ID selectors.
* **Missing accessible names**: Cart icon on saucedemo.com has no accessible name -- BAP observe cannot discover it. Must navigate directly to `/cart.html`.
* **Fused navigate on SPAs**: `navigate(observe:true)` can return empty on first call in a new MCP session for SPA sites. Workaround: preflight `navigate("about:blank")`.
* **Default maxElements**: `maxElements=50` truncates observe output on large pages (books.toscrape.com has 100+ sidebar category links before the pagination link).

## Fairness Statement

These benchmarks are designed to be honest, not promotional:

* **BAP Standard is the fair comparison.** It follows the same observe-then-act pattern as Playwright.
* **Latency favors Playwright.** BAP's two-process architecture adds WebSocket overhead per call.
* **Token estimation is approximate.** `ceil(bytes / 4)` is a rough heuristic. Screenshots inflate counts due to base64 encoding.
* **No LLM involved.** All tool arguments are pre-written.
* **BAP `extract` uses heuristics.** Playwright's `browser_evaluate` runs precise DOM queries and may return more accurate results.

<Note>
  The bottom line: BAP Standard uses \~15% fewer tool calls than Playwright in an apples-to-apples
  comparison, primarily from batching multi-step actions. BAP Fused extends this to \~37% through
  fusion. Playwright wins on per-call latency and element disambiguation.
</Note>
