Data Extraction & Scraping

Basic Extraction

CLI

# Extract specific fields
bap extract --fields="title,price,rating"

# Extract a list of items
bap extract --list="product" --fields="name,price,url"

# Extract with a JSON schema
bap extract --schema='{"type":"array","items":{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}}'

TypeScript SDK

const data = await client.extract({
  instruction: "Extract all product names and prices",
  schema: {
    type: "array",
    items: {
      type: "object",
      properties: {
        name: { type: "string", description: "Product name" },
        price: { type: "number", description: "Price in dollars" },
      },
    },
  },
  mode: "list",
});

if (data.success) {
  for (const product of data.data) {
    console.log(`${product.name}: $${product.price}`);
  }
}

Python SDK

data = await client.extract(
    instruction="Extract all product names and prices",
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
            },
        },
    },
)

Extraction Modes

Mode	Description	Use Case
`single`	Extract one item matching the schema	Product detail page, user profile
`list`	Extract all matching items	Search results, product listings
`table`	Extract tabular data	Pricing tables, comparison charts

// Single item extraction
const product = await client.extract({
  instruction: "Extract the main product details",
  schema: {
    type: "object",
    properties: {
      name: { type: "string" },
      price: { type: "number" },
      description: { type: "string" },
      inStock: { type: "boolean" },
    },
  },
  mode: "single",
});

// Table extraction
const pricing = await client.extract({
  instruction: "Extract the pricing comparison table",
  schema: {
    type: "array",
    items: {
      type: "object",
      properties: {
        plan: { type: "string" },
        price: { type: "string" },
        features: { type: "string" },
      },
    },
  },
  mode: "table",
});

Scoped Extraction

Limit extraction to a specific container to avoid pulling data from sidebars or navigation:

const data = await client.extract({
  instruction: "Extract articles",
  schema: {
    /* ... */
  },
  selector: { type: "css", value: "main.content" },
});

Scoped extraction is important on pages with complex layouts. Without a scope selector, the extractor may pick up sidebar items, footer links, or navigation elements that match your schema.

Source References

Track which DOM elements contributed to each extracted value:

const data = await client.extract({
  instruction: "Extract product listings",
  schema: {
    /* ... */
  },
  includeSourceRefs: true,
});

if (data.sources) {
  for (const source of data.sources) {
    console.log(`Ref: ${source.ref}, Text: ${source.text}`);
  }
}

Pagination

For multi-page scraping, combine extraction with navigation:

const allProducts = [];

while (true) {
  const page = await client.extract({
    instruction: "Extract all products on this page",
    schema: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          price: { type: "number" },
        },
      },
    },
    mode: "list",
  });

  allProducts.push(...page.data);

  // Check for "Next" button
  const obs = await client.observe({ filterRoles: ["link"] });
  const nextLink = obs.interactiveElements?.find((el) => el.name === "Next");

  if (!nextLink) break; // No more pages
  await client.click(nextLink.selector);
}

CLI Pagination

# Extract from current page
bap extract --list="book" --fields="title,price"

# Navigate to next page and extract again
bap click text:"Next"
bap extract --list="book" --fields="title,price"

Confidence Scores

Extraction results include an optional confidence score (0-1):

const data = await client.extract({
  /* ... */
});
console.log(`Confidence: ${data.confidence}`);
// 0.95 = high confidence
// 0.60 = some fields may be inaccurate

The extract method uses heuristic-based DOM analysis, not LLM reasoning. For complex pages where the schema fields do not map cleanly to visible text, extraction accuracy may be lower. Consider using observe/content with your own parsing logic for edge cases.

​Basic Extraction

​CLI

​TypeScript SDK

​Python SDK

​Extraction Modes

​Scoped Extraction

​Source References

​Pagination

​CLI Pagination

​Confidence Scores

Basic Extraction

CLI

TypeScript SDK

Python SDK

Extraction Modes

Scoped Extraction

Source References

Pagination

CLI Pagination

Confidence Scores