Content Extraction

Reader automatically extracts the main content from web pages, removing navigation, headers, footers, ads, and other non-content elements.

How It Works

By default (onlyMainContent: true), Reader uses a multi-step algorithm:

1. Find Main Content Container

Reader looks for main content in this order:

<main> element
[role="main"] attribute
Single <article> element
Common content IDs/classes (#content, .post-content, etc.)
Largest text block (fallback heuristic)

If no main content container is found, Reader removes:

<nav>, <header>, <footer>, <aside>
Sidebars, menus, breadcrumbs
Social sharing, comments sections
Newsletter forms, cookie banners

3. Always Remove

Regardless of mode, Reader always removes:

Scripts, styles, noscript, templates
Hidden elements
Overlays, modals, popups
Cookie consent banners
Fixed/sticky positioned elements
Ads and tracking pixels

Controlling Extraction

Disable Main Content Extraction

For full-page capture (includes nav, header, footer):

const result = await reader.scrape({
  urls: ["https://example.com"],
  onlyMainContent: false,
});

Include Specific Elements

Keep only specific elements using CSS selectors:

const result = await reader.scrape({
  urls: ["https://example.com"],
  includeTags: [".article-content", "#main"],
});

Exclude Specific Elements

Remove specific elements:

const result = await reader.scrape({
  urls: ["https://example.com"],
  excludeTags: [".comments", ".related-posts", ".sidebar"],
});

Combine Include and Exclude

const result = await reader.scrape({
  urls: ["https://example.com"],
  includeTags: [".article"],
  excludeTags: [".article-comments", ".article-share"],
});

CLI Options

# Disable main content extraction
npx reader scrape https://example.com --no-main-content

# Include specific elements
npx reader scrape https://example.com --include-tags ".article,.content"

# Exclude specific elements
npx reader scrape https://example.com --exclude-tags ".comments,.sidebar"

HTML to Markdown

Reader uses supermarkdown for HTML to Markdown conversion, a high-performance Rust library with full GFM support.

Supported Elements

Element	Markdown Output
Headings	`# H1`, `## H2`, etc.
Paragraphs	Plain text with blank lines
Lists	`-` or `1.`
Links	`[text](url)`
Images	`![alt](src)`
Code	`inline` or fenced blocks
Tables	GFM table syntax
Blockquotes	`> quoted text`

Examples

Blog Post

// Extract just the article content
const result = await reader.scrape({
  urls: ["https://blog.example.com/post"],
  includeTags: ["article", ".post-content"],
  excludeTags: [".author-bio", ".related-posts"],
});

Documentation

// Keep sidebar for navigation context
const result = await reader.scrape({
  urls: ["https://docs.example.com/guide"],
  onlyMainContent: false,
  excludeTags: ["nav", "footer", ".announcement-banner"],
});

E-commerce Product

// Extract product details only
const result = await reader.scrape({
  urls: ["https://shop.example.com/product"],
  includeTags: [".product-details", ".product-description"],
  excludeTags: [".reviews", ".recommendations"],
});

Documentation

Concepts

Guides

Content Extraction

How It Works

1. Find Main Content Container

2. Remove Navigation Chrome

3. Always Remove

Controlling Extraction

Disable Main Content Extraction

Include Specific Elements

Exclude Specific Elements

Combine Include and Exclude

CLI Options

HTML to Markdown

Supported Elements

Examples

Blog Post

Documentation

E-commerce Product

Next Steps

Basic Scraping

ScrapeOptions

Documentation

Concepts

Guides

​How It Works

​1. Find Main Content Container

​2. Remove Navigation Chrome

​3. Always Remove

​Controlling Extraction

​Disable Main Content Extraction

​Include Specific Elements

​Exclude Specific Elements

​Combine Include and Exclude

​CLI Options

​HTML to Markdown

​Supported Elements

​Examples

​Blog Post

​Documentation

​E-commerce Product

​Next Steps

Basic Scraping

ScrapeOptions

How It Works

1. Find Main Content Container

2. Remove Navigation Chrome

3. Always Remove

Controlling Extraction

Disable Main Content Extraction

Include Specific Elements

Exclude Specific Elements

Combine Include and Exclude

CLI Options

HTML to Markdown

Supported Elements

Examples

Blog Post

Documentation

E-commerce Product

Next Steps