Skip to main content
Crawling allows you to discover pages on a website automatically.

How It Works

When you call crawl(), Reader:
  1. Fetches the seed URL and extracts all links
  2. Filters links by domain, patterns, and robots.txt
  3. Queues new URLs using breadth-first search
  4. Continues until depth or page limits are reached
  5. Optionally scrapes content from discovered pages

Basic Usage

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 50,
});

console.log(`Found ${result.urls.length} pages`);
result.urls.forEach((page) => {
  console.log(`- ${page.url}: ${page.title}`);
});

await reader.close();

Crawl with Scraping

To also scrape the content of discovered pages:
const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 50,
  scrape: true,
});

console.log(`Discovered ${result.urls.length} URLs`);
console.log(`Scraped ${result.scraped?.batchMetadata.successfulUrls} pages`);

// Access scraped content
result.scraped?.data.forEach((page) => {
  console.log(page.markdown);
});

Crawl Options

OptionDefaultDescription
urlrequiredSeed URL to start crawling
depth1Maximum crawl depth
maxPages20Maximum pages to discover
scrapefalseAlso scrape content
delayMs1000Delay between requests
includePatterns[]URL patterns to include (regex)
excludePatterns[]URL patterns to exclude (regex)

Depth Explained

Depth controls how far from the seed URL the crawler will go:
  • Depth 0: Only the seed URL
  • Depth 1: Seed URL + pages linked from it
  • Depth 2: Seed URL + linked pages + pages linked from those
Seed URL (depth 0)
├── Page A (depth 1)
│   ├── Page D (depth 2)
│   └── Page E (depth 2)
├── Page B (depth 1)
│   └── Page F (depth 2)
└── Page C (depth 1)

URL Patterns

Filter which URLs are crawled using regex patterns:
const result = await reader.crawl({
  url: "https://example.com",
  depth: 3,
  maxPages: 100,
  includePatterns: ["^/blog/", "^/docs/"],
  excludePatterns: ["^/admin/", "^/api/"],
});

Crawl Result Structure

interface CrawlResult {
  urls: CrawlUrl[];
  scraped?: ScrapeResult;
  metadata: CrawlMetadata;
}

interface CrawlUrl {
  url: string;
  title: string;
  description: string | null;
}

interface CrawlMetadata {
  totalUrls: number;
  maxDepth: number;
  totalDuration: number;
  seedUrl: string;
}

Rate Limiting

Reader automatically adds delays between requests to avoid overwhelming servers:
const result = await reader.crawl({
  url: "https://example.com",
  delayMs: 2000, // 2 seconds between requests
});

Domain Restrictions

By default, crawling is restricted to the same domain as the seed URL. Links to external domains are not followed.

Next Steps