Crawling

Crawling allows you to discover pages on a website automatically.

How It Works

When you call crawl(), Reader:

Fetches the seed URL and extracts all links
Filters links by domain, patterns, and robots.txt
Queues new URLs using breadth-first search
Continues until depth or page limits are reached
Optionally scrapes content from discovered pages

Basic Usage

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 50,
});

console.log(`Found ${result.urls.length} pages`);
result.urls.forEach((page) => {
  console.log(`- ${page.url}: ${page.title}`);
});

await reader.close();

Crawl with Scraping

To also scrape the content of discovered pages:

const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 50,
  scrape: true,
});

console.log(`Discovered ${result.urls.length} URLs`);
console.log(`Scraped ${result.scraped?.batchMetadata.successfulUrls} pages`);

// Access scraped content
result.scraped?.data.forEach((page) => {
  console.log(page.markdown);
});

Crawl Options

Option	Default	Description
`url`	required	Seed URL to start crawling
`depth`	`1`	Maximum crawl depth
`maxPages`	`20`	Maximum pages to discover
`scrape`	`false`	Also scrape content
`delayMs`	`1000`	Delay between requests
`includePatterns`	`[]`	URL patterns to include (regex)
`excludePatterns`	`[]`	URL patterns to exclude (regex)

Depth Explained

Depth controls how far from the seed URL the crawler will go:

Depth 0: Only the seed URL
Depth 1: Seed URL + pages linked from it
Depth 2: Seed URL + linked pages + pages linked from those

Seed URL (depth 0)
├── Page A (depth 1)
│   ├── Page D (depth 2)
│   └── Page E (depth 2)
├── Page B (depth 1)
│   └── Page F (depth 2)
└── Page C (depth 1)

URL Patterns

Filter which URLs are crawled using regex patterns:

const result = await reader.crawl({
  url: "https://example.com",
  depth: 3,
  maxPages: 100,
  includePatterns: ["^/blog/", "^/docs/"],
  excludePatterns: ["^/admin/", "^/api/"],
});

Crawl Result Structure

interface CrawlResult {
  urls: CrawlUrl[];
  scraped?: ScrapeResult;
  metadata: CrawlMetadata;
}

interface CrawlUrl {
  url: string;
  title: string;
  description: string | null;
}

interface CrawlMetadata {
  totalUrls: number;
  maxDepth: number;
  totalDuration: number;
  seedUrl: string;
}

Rate Limiting

Reader automatically adds delays between requests to avoid overwhelming servers:

const result = await reader.crawl({
  url: "https://example.com",
  delayMs: 2000, // 2 seconds between requests
});

Domain Restrictions

By default, crawling is restricted to the same domain as the seed URL. Links to external domains are not followed.

Documentation

Concepts

Guides

How It Works

Basic Usage

Crawl with Scraping

Crawl Options

Depth Explained

URL Patterns

Crawl Result Structure

Rate Limiting

Domain Restrictions

Next Steps

Basic Scraping

Proxy Configuration

Documentation

Concepts

Guides

​How It Works

​Basic Usage

​Crawl with Scraping

​Crawl Options

​Depth Explained

​URL Patterns

​Crawl Result Structure

​Rate Limiting

​Domain Restrictions

​Next Steps

Basic Scraping

Proxy Configuration

How It Works

Basic Usage

Crawl with Scraping

Crawl Options

Depth Explained

URL Patterns

Crawl Result Structure

Rate Limiting

Domain Restrictions

Next Steps