The problem with assumptions
Most website optimization assumes a visitor fetches your page via HTTP, sees the headers, follows redirects, and renders HTML. AI agents break every one of these assumptions.
ChatGPT
ChatGPT's browsing tool fetches pages live via HTTP, but the model never sees the raw response:
- Text extraction only — HTML is stripped to ~4,096 tokens of plain text before the model sees it
- No headers — the model never knows Content-Type, status codes, or redirects
- SearchGPT intermediate — a secondary model checks for prompt injection before content reaches the main model
- Agent Mode uses a fake Chrome UA (
Chrome/138.0.0.0) and identifies via RFC 9421 cryptographic signatures, not User-Agent
What this means: Content negotiation works silently (the tool layer handles it), but the model only sees the extracted text. Serve clean, structured text and your content will be more useful to ChatGPT.
Perplexity
Perplexity uses a multi-stage retrieval pipeline:
- Stealth crawlers — 3-6 million requests/day with generic Chrome UAs and rotating IPs, not
PerplexityBot - Hybrid ranking — BM25 keyword matching + vector similarity to find relevant passages
- Atomic span retrieval — extracts specific text spans rather than full pages
- Separate index — maintains its own crawled index alongside web search results
What this means: Your robots.txt rules for PerplexityBot may not stop their stealth crawlers. Structured content with clear headings helps their span extraction find the right passages.
Gemini
Gemini's most common browsing mode never hits your server at all:
- Index-based —
url_contextreads from Google's internal index, not live HTTP. When tested, no request appeared in server logs - Screenshot-based — Project Mariner renders the page visually for tasks that need it
- Rejected markdown — Gemini CLI rejected
Accept: text/markdownresponses in early testing
What this means: Your site needs to be indexed by Googlebot for Gemini to see it. Adding in your HTML ensures Google indexes the llms.txt relationship. JSON-LD structured data also survives the indexing pipeline.
What to do about it
| Action | Helps with |
|--------|------------|
| Serve llms.txt with clean markdown | ChatGPT, Perplexity |
| Add | Gemini (via Google index) |
| Add JSON-LD structured data | Gemini (via Google index) |
| Don't block Google-Extended in robots.txt | Gemini |
| Use RFC 9421 signatures for bot auth | ChatGPT Agent Mode verification |
| Serve structured content with clear headings | Perplexity span extraction |