Why not ask Reader for JSON directly?
We thought about building structured extraction into Reader. We chose not to for three reasons:- The shape you want is yours, not ours. Every use case has different fields. A product extractor needs price, title, SKU; a recipe extractor needs ingredients, steps, times; a job listing extractor needs title, company, location, salary. A single “structured extraction” feature in Reader would have to cover every shape and still fall short.
- LLMs are already great at this. A one-shot prompt with a JSON schema gets you 95% of the way there on most content.
- It keeps Reader focused. Reader does one thing well: web content to clean markdown.
The minimal extractor
Tips for reliable extraction
- Use tool calling / structured output mode if your model supports it. Claude’s tool use and OpenAI’s JSON mode let you pass the schema directly and get conforming JSON back instead of parsing markdown-wrapped JSON yourself.
- Validate with a schema library.
zod,pydantic, or similar catches hallucinated fields before they contaminate your database. - Keep the field list short. A prompt asking for 30 fields is less reliable than three calls asking for 10 each.
- Include examples if accuracy matters. Two or three “here’s a page, here’s the right extraction” pairs in the prompt dramatically improve consistency.
- Pin the model version. Extraction behavior changes across model releases. Lock the version for stable pipelines.
Extracting from structured metadata
Before reaching for an LLM, check if the page has structured metadata you can read directly. Reader preserves JSON-LD and OpenGraph tags in the page, but they’re not surfaced separately; you’d parse them yourself if present. For many e-commerce and article pages, JSON-LD (<script type="application/ld+json">) contains the exact fields you want with no ambiguity.
Request formats: ["markdown", "html"] and parse JSON-LD out of the HTML for the deterministic path, falling back to LLM extraction when it’s not present.

