Skip to main content

Required

OptionTypeDescription
urlsstring[]URLs to scrape. Pass a single-element array for a single page.

Output

OptionTypeDefaultDescription
formatsArray<"markdown" | "html">["markdown"]Output formats to include in results. rawHtml is always returned regardless of this setting.

Request

OptionTypeDefaultDescription
userAgentstringChrome UACustom user agent string
headersRecord<string, string>{}Additional HTTP headers
timeoutMsnumber30000Request timeout per URL
waitForSelectorstring-Wait for this CSS selector before extracting
skipTLSVerificationbooleantrueSkip TLS certificate verification

Content cleaning

OptionTypeDefaultDescription
onlyMainContentbooleantrueExtract only main content (strips nav/header/footer)
includeTagsstring[][]CSS selectors to keep - everything else is removed
excludeTagsstring[][]CSS selectors to remove
removeAdsbooleantrueRemove common ad selectors
removeBase64ImagesbooleantrueStrip inline base64 images
navigationSelectorsstring[][]Additional CSS selectors to remove when onlyMainContent is true. Merged with built-in nav/footer/sidebar selectors.

Retry & escalation

OptionTypeDefaultDescription
hardDeadlineMsnumber30000Hard deadline for a single URL. After this, the scraper gives up.
datacenterTimeoutMsnumber10000Timeout for the first attempt on datacenter proxy. If no result, escalates to residential.

Batch processing

OptionTypeDefaultDescription
batchConcurrencynumber1Number of URLs to process in parallel
batchTimeoutMsnumber300000Total timeout for the batch (5 min)
maxRetriesnumber2Max retries per URL before giving up
onProgress(p: ProgressEvent) => void-Progress callback
interface ProgressEvent {
  completed: number;
  total: number;
  currentUrl: string;
}

Proxy

OptionTypeDefaultDescription
proxyProxyConfig-Single proxy to use for this request
proxyTier"datacenter" | "residential" | "auto"-Pick a proxy from the configured pool by tier

Pluggable config

These options let the caller inject platform-specific behavior. Reader ships with no built-in domain profiles, block detection patterns, or URL rewriters.
OptionTypeDefaultDescription
domainProfilesRecord<string, DomainProfile>{}Per-domain overrides (proxy tier, timeout, concurrency). Keyed by domain.
blockDetectionBlockDetectionConfig-Bot page detection config. Without this, no content-based block detection runs.
urlRewritersUrlRewriteRule[][]URL rewrite rules applied before scraping (e.g. Google Docs to export URL).
interface DomainProfile {
  proxyTier?: "datacenter" | "residential";
  timeoutMs?: number;
  batchConcurrency?: number;
  minDelayMs?: number;
  maxConcurrentPerProxy?: number;
}

interface BlockDetectionConfig {
  patterns?: Array<RegExp | string>;      // matched against page text
  titlePatterns?: Array<RegExp | string>; // matched against page title
  shortContentThreshold?: number;         // default: 500
  longContentSignalThreshold?: number;    // default: 3
}

interface UrlRewriteRule {
  name: string;
  match: (url: URL) => boolean;
  rewrite: (url: URL) => string;
}

URL filtering

OptionTypeDefaultDescription
includePatternsstring[][]Regex patterns - URL must match at least one
excludePatternsstring[][]Regex patterns - URL must not match any

Debugging

OptionTypeDefaultDescription
verbosebooleanfalseEnable Pino logging
showChromebooleanfalseShow the browser window

Where to go next

ScrapeResult

The return type for every scrape call.

Content Extraction

How the cleaning options behave.