Skip to main content

Required

OptionTypeDescription
urlstringSeed URL where crawling starts

Crawl control

OptionTypeDefaultDescription
depthnumber1Max crawl depth from the seed
maxPagesnumber20Max pages to discover (hard limit)
scrapebooleanfalseAlso scrape each discovered page
delayMsnumber1000Delay between requests (rate limiting)
timeoutMsnumber-Total crawl timeout

URL filtering

OptionTypeDefaultDescription
includePatternsstring[][]Regex - URL must match at least one
excludePatternsstring[][]Regex - URL must not match any
Crawling is always same-domain (Reader does not follow external links). Use these patterns to further scope within a domain.

When scrape: true

These options only apply when scrape: true:
OptionTypeDefaultDescription
formatsArray<"markdown" | "html">["markdown"]Output formats
scrapeConcurrencynumber2Parallel scrapes during the crawl
removeAdsbooleantrueRemove ad selectors
removeBase64ImagesbooleantrueStrip inline base64 images

Proxy & misc

OptionTypeDefaultDescription
proxyProxyConfig-Single proxy for this crawl
proxyTier"datacenter" | "residential" | "auto"-Pick a proxy from the configured pool
userAgentstringChrome UACustom user agent
verbosebooleanfalseEnable logging
showChromebooleanfalseShow browser window

Example

await reader.crawl({
  url: "https://docs.example.com",
  depth: 3,
  maxPages: 100,
  scrape: true,
  scrapeConcurrency: 3,
  formats: ["markdown"],
  includePatterns: ["^https://docs\\.example\\.com/(api|guides)/"],
  excludePatterns: ["/changelog/", "/archive/"],
  delayMs: 1500,
});

Where to go next

CrawlResult

The return type for every crawl call.

Crawling concept

How BFS link discovery works.