Skip to main content
Crawling allows you to discover pages on a website automatically.

How It Works

When you call crawl(), Reader:
  1. Fetches the seed URL and extracts all links
  2. Filters links by domain, patterns, and robots.txt
  3. Queues new URLs using breadth-first search
  4. Continues until depth or page limits are reached
  5. Optionally scrapes content from discovered pages

Basic Usage

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 50,
});

console.log(`Found ${result.urls.length} pages`);
result.urls.forEach((page) => {
  console.log(`- ${page.url}: ${page.title}`);
});

await reader.close();

Crawl with Scraping

To also scrape the content of discovered pages:
const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 50,
  scrape: true,
});

console.log(`Discovered ${result.urls.length} URLs`);
console.log(`Scraped ${result.scraped?.batchMetadata.successfulUrls} pages`);

// Access scraped content
result.scraped?.data.forEach((page) => {
  console.log(page.markdown);
});

Crawl Options

OptionDefaultDescription
urlrequiredSeed URL to start crawling
depth1Maximum crawl depth
maxPages20Maximum pages to discover
scrapefalseAlso scrape content
delayMs1000Delay between requests
includePatterns[]URL patterns to include (regex)
excludePatterns[]URL patterns to exclude (regex)

Depth Explained

Depth controls how far from the seed URL the crawler will go:
  • Depth 0: Only the seed URL
  • Depth 1: Seed URL + pages linked from it
  • Depth 2: Seed URL + linked pages + pages linked from those
Seed URL (depth 0)
├── Page A (depth 1)
│   ├── Page D (depth 2)
│   └── Page E (depth 2)
├── Page B (depth 1)
│   └── Page F (depth 2)
└── Page C (depth 1)

URL Patterns

Filter which URLs are crawled using regex patterns:
const result = await reader.crawl({
  url: "https://example.com",
  depth: 3,
  maxPages: 100,
  includePatterns: ["^/blog/", "^/docs/"],
  excludePatterns: ["^/admin/", "^/api/"],
});

Crawl Result Structure

interface CrawlResult {
  urls: CrawlUrl[];
  scraped?: ScrapeResult;
  metadata: CrawlMetadata;
}

interface CrawlUrl {
  url: string;
  title: string;
  description: string | null;
}

interface CrawlMetadata {
  totalUrls: number;
  maxDepth: number;
  totalDuration: number;
  seedUrl: string;
}

Rate Limiting

Reader automatically adds delays between requests to avoid overwhelming servers:
const result = await reader.crawl({
  url: "https://example.com",
  delayMs: 2000, // 2 seconds between requests
});

Domain Restrictions

By default, crawling is restricted to the same domain as the seed URL. Links to external domains are not followed.

Next Steps

Basic Scraping

Learn about scraping individual URLs

Proxy Configuration

Use proxies for crawling