Default behavior
WithonlyMainContent: true (the default), Reader runs a multi-step extraction:
-
Find the main content container in priority order:
<main>element[role="main"]attribute- Single
<article>element - Common content IDs and classes:
#content,.post-content,.article-body, etc. - Largest text block (fallback heuristic)
-
Remove navigation chrome (if no main content container was found):
<nav>,<header>,<footer>,<aside>- Sidebars, menus, breadcrumbs
- Social sharing widgets, comment sections
- Newsletter forms, cookie banners
-
Always remove (regardless of mode):
<script>,<style>,<noscript>,<template>- Hidden elements (
display: none,visibility: hidden) - Overlays, modals, popups
- Fixed and sticky positioned elements
- Ad selectors and tracking pixels
- Base64 inline images (unless
removeBase64Images: false)
-
Resolve responsive images -
srcsetattributes are parsed and the highest-resolution image is kept in the output.
Disabling main content extraction
For full-page capture (including nav, header, footer), setonlyMainContent: false:
- You’re scraping a landing page where the “main content” is the whole page
- You need to extract links from navigation
- You’re debugging and want to see what Reader sees before cleaning
Tag filtering with CSS selectors
For fine-grained control, useincludeTags and excludeTags:
includeTags- keep only elements matching these selectors. Everything else is removed.excludeTags- remove elements matching these selectors. Everything else is kept.
includeTags and excludeTags together. Include runs first, then exclude.
Why Reader is conservative with nav/header/footer
Many modern sites put actual content inside<nav> or <header> elements - for example, a docs sidebar lives in <nav> but contains links you want to crawl. Reader’s extraction is deliberately conservative by default: if a <main> or <article> container is found, the surrounding chrome is kept (because it might contain content you need), and only gets stripped if no main container was detected.
If you want aggressive stripping for a site where the <main> detection isn’t kicking in, use excludeTags explicitly:
Markdown conversion
After extraction, the cleaned HTML is converted to markdown via supermarkdown - a Rust-backed converter optimized for LLM input. It handles:- Headings, lists, tables
- Code blocks with language detection
- Links and images (resolved to absolute URLs)
- Blockquotes and inline formatting
- Unicode and emoji preservation
Where to go next
Scraping
Back to the full scrape pipeline.
ScrapeOptions reference
All content extraction options listed.

