> ## Documentation Index
> Fetch the complete documentation index at: https://piyushvyas.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Extraction & Scraping

> Extract structured data from web pages using bap extract with JSON Schema, field hints, and pagination.

## Basic Extraction

### CLI

```bash theme={null}
# Extract specific fields
bap extract --fields="title,price,rating"

# Extract a list of items
bap extract --list="product" --fields="name,price,url"

# Extract with a JSON schema
bap extract --schema='{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}}'
```

### TypeScript SDK

```typescript theme={null}
const data = await client.extract({
  instruction: "Extract all product names and prices",
  schema: {
    type: "array",
    items: {
      type: "object",
      properties: {
        name: { type: "string", description: "Product name" },
        price: { type: "number", description: "Price in dollars" },
      },
    },
  },
  mode: "list",
});

if (data.success) {
  for (const product of data.data) {
    console.log(`${product.name}: $${product.price}`);
  }
}
```

### Python SDK

```python theme={null}
data = await client.extract(
    instruction="Extract all product names and prices",
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
            },
        },
    },
)
```

## Extraction Modes

| Mode     | Description                          | Use Case                          |
| -------- | ------------------------------------ | --------------------------------- |
| `single` | Extract one item matching the schema | Product detail page, user profile |
| `list`   | Extract all matching items           | Search results, product listings  |
| `table`  | Extract tabular data                 | Pricing tables, comparison charts |

```typescript theme={null}
// Single item extraction
const product = await client.extract({
  instruction: "Extract the main product details",
  schema: {
    type: "object",
    properties: {
      name: { type: "string" },
      price: { type: "number" },
      description: { type: "string" },
      inStock: { type: "boolean" },
    },
  },
  mode: "single",
});

// Table extraction
const pricing = await client.extract({
  instruction: "Extract the pricing comparison table",
  schema: {
    type: "array",
    items: {
      type: "object",
      properties: {
        plan: { type: "string" },
        price: { type: "string" },
        features: { type: "string" },
      },
    },
  },
  mode: "table",
});
```

## Scoped Extraction

Limit extraction to a specific container to avoid pulling data from sidebars or navigation:

```typescript theme={null}
const data = await client.extract({
  instruction: "Extract articles",
  schema: {
    /* ... */
  },
  selector: { type: "css", value: "main.content" },
});
```

<Tip>
  Scoped extraction is important on pages with complex layouts. Without a scope selector, the
  extractor may pick up sidebar items, footer links, or navigation elements that match your schema.
</Tip>

## Source References

Track which DOM elements contributed to each extracted value:

```typescript theme={null}
const data = await client.extract({
  instruction: "Extract product listings",
  schema: {
    /* ... */
  },
  includeSourceRefs: true,
});

if (data.sources) {
  for (const source of data.sources) {
    console.log(`Ref: ${source.ref}, Text: ${source.text}`);
  }
}
```

## Pagination

For multi-page scraping, combine extraction with navigation:

```typescript theme={null}
const allProducts = [];

while (true) {
  const page = await client.extract({
    instruction: "Extract all products on this page",
    schema: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          price: { type: "number" },
        },
      },
    },
    mode: "list",
  });

  allProducts.push(...page.data);

  // Check for "Next" button
  const obs = await client.observe({ filterRoles: ["link"] });
  const nextLink = obs.interactiveElements?.find((el) => el.name === "Next");

  if (!nextLink) break; // No more pages
  await client.click(nextLink.selector);
}
```

### CLI Pagination

```bash theme={null}
# Extract from current page
bap extract --list="book" --fields="title,price"

# Navigate to next page and extract again
bap click text:"Next"
bap extract --list="book" --fields="title,price"
```

## Confidence Scores

Extraction results include an optional confidence score (0-1):

```typescript theme={null}
const data = await client.extract({
  /* ... */
});
console.log(`Confidence: ${data.confidence}`);
// 0.95 = high confidence
// 0.60 = some fields may be inaccurate
```

<Warning>
  The `extract` method uses heuristic-based DOM analysis, not LLM reasoning. For complex pages where
  the schema fields do not map cleanly to visible text, extraction accuracy may be lower. Consider
  using `observe/content` with your own parsing logic for edge cases.
</Warning>
