How It Works
When you callcrawl(), Reader:
- Fetches the seed URL and extracts all links
- Filters links by domain, patterns, and robots.txt
- Queues new URLs using breadth-first search
- Continues until depth or page limits are reached
- Optionally scrapes content from discovered pages
Basic Usage
Crawl with Scraping
To also scrape the content of discovered pages:Crawl Options
| Option | Default | Description |
|---|---|---|
url | required | Seed URL to start crawling |
depth | 1 | Maximum crawl depth |
maxPages | 20 | Maximum pages to discover |
scrape | false | Also scrape content |
delayMs | 1000 | Delay between requests |
includePatterns | [] | URL patterns to include (regex) |
excludePatterns | [] | URL patterns to exclude (regex) |
Depth Explained
Depth controls how far from the seed URL the crawler will go:- Depth 0: Only the seed URL
- Depth 1: Seed URL + pages linked from it
- Depth 2: Seed URL + linked pages + pages linked from those
URL Patterns
Filter which URLs are crawled using regex patterns:Crawl Result Structure
Rate Limiting
Reader automatically adds delays between requests to avoid overwhelming servers:Domain Restrictions
By default, crawling is restricted to the same domain as the seed URL. Links to external domains are not followed.Next Steps
Basic Scraping
Learn about scraping individual URLs
Proxy Configuration
Use proxies for crawling

