Formats
Passformats on the request to choose what comes back:
| Value | What you get |
|---|---|
"markdown" (default) | Clean, structured markdown. Headings, lists, links, code blocks preserved. Best for LLMs and RAG. |
"html" | The cleaned HTML Reader used as the source for markdown conversion. Useful when you need to run your own DOM parsing. |
rawHtml — the unprocessed HTML exactly as the browser rendered it, before any cleaning or content extraction. This is always returned regardless of formats.
You can request both:
Main content extraction
By default Reader strips away navigation, footers, sidebars, cookie banners, newsletter pop-ups, and other boilerplate, keeping just the article body. This is theonlyMainContent: true default.
Turn it off when you need the whole page:
- You want the nav bar’s links (e.g., to find related pages)
- You’re scraping a landing page where there is no “article”
- You’re debugging why something was stripped
- LLM pipelines: boilerplate is noise and tokens
- RAG indexing: you don’t want “Cookie Settings” matching your user’s query
- Clean markdown output for humans
Include and exclude selectors
For finer control, give Reader a list of CSS selectors to keep or drop. These compose withonlyMainContent.
includeTags: keep only content matching these selectors. Everything else is dropped.excludeTags: drop anything matching these, keep the rest.
includeTags runs first, then excludeTags trims what remains.
Wait for a selector
Dynamic pages sometimes render the real content a moment after the initial load. For example, a product grid that’s hydrated from JSON. Tell Reader to wait for a specific selector before capturing:timeoutMs is hit, whichever comes first.
What’s in metadata
Every scrape result includes metadata about the page and the request:
title,description: extracted from<title>,<meta>tags, or Open Graph datastatusCode: what the target site returnedduration: how long Reader spent on the request, in mscached: whether the content was served from Reader’s cacheproxyMode,proxyEscalated: see Proxy modesscrapedAt: when the content was captured
Next
- Proxy modes: how Reader decides to fetch a page
- Caching: when to reuse a previous result

