The crawl() primitive
{ url, title, description } entries for every discovered page, but no content.
Crawl and scrape in one call
Setscrape: true to scrape every discovered page as it’s found:
result.scraped is a full ScrapeResult with markdown for every page.
How the BFS works
Reader uses breadth-first search starting from the seed URL:- Fetch the seed and extract all links via DOM parsing (linkedom).
- Filter the links: same domain, not already visited, matches
includePatterns, doesn’t matchexcludePatterns, not blocked by robots.txt. - Enqueue matching links at
depth + 1ifdepth < maxDepth. - Rate limit - wait
delayMs(default 1000ms) before the next request. - Stop when the queue is empty or
maxPagesis hit.
maxPages you get are the same on every run.
Crawl control
| Option | Default | Purpose |
|---|---|---|
depth | 1 | Max hops from the seed URL |
maxPages | 20 | Stop after N pages regardless of depth |
delayMs | 1000 | Rate limit between requests |
includePatterns | [] | Regex - URL must match at least one |
excludePatterns | [] | Regex - URL must not match any |
includePatterns:
Same-domain only
Crawling is constrained to the seed’s domain. Reader does not follow external links - that would explode the crawl scope and credits unpredictably. If you need cross-domain discovery, build it yourself by callingcrawl() for each domain separately.
Sticky proxy per crawl
When you configure proxy pools, Reader picks one proxy at the start of a crawl session and uses it for every request in that crawl. This mimics how a real user would browse a site from a single IP - rotating IPs mid-crawl tends to trip anti-bot systems. See Proxy Tiers.Where to go next
Website Crawling guide
Patterns for scoping, filtering, and scraping crawls.
CrawlOptions reference
Every option, every default.

