How AI Agents Actually Browse the Web

The problem with assumptions

Most website optimization assumes a visitor fetches your page via HTTP, sees the headers, follows redirects, and renders HTML. AI agents break every one of these assumptions.

ChatGPT

ChatGPT's browsing tool fetches pages live via HTTP, but the model never sees the raw response:

Text extraction only — HTML is stripped to ~4,096 tokens of plain text before the model sees it
No headers — the model never knows Content-Type, status codes, or redirects
SearchGPT intermediate — a secondary model checks for prompt injection before content reaches the main model
Agent Mode uses a fake Chrome UA (Chrome/138.0.0.0) and identifies via RFC 9421 cryptographic signatures, not User-Agent

What this means: Content negotiation works silently (the tool layer handles it), but the model only sees the extracted text. Serve clean, structured text and your content will be more useful to ChatGPT.

Perplexity

Perplexity uses a multi-stage retrieval pipeline:

Stealth crawlers — 3-6 million requests/day with generic Chrome UAs and rotating IPs, not PerplexityBot
Hybrid ranking — BM25 keyword matching + vector similarity to find relevant passages
Atomic span retrieval — extracts specific text spans rather than full pages
Separate index — maintains its own crawled index alongside web search results

What this means: Your robots.txt rules for PerplexityBot may not stop their stealth crawlers. Structured content with clear headings helps their span extraction find the right passages.

Gemini

Gemini's most common browsing mode never hits your server at all:

Index-based — url_context reads from Google's internal index, not live HTTP. When tested, no request appeared in server logs
Screenshot-based — Project Mariner renders the page visually for tasks that need it
Rejected markdown — Gemini CLI rejected Accept: text/markdown responses in early testing

What this means: Your site needs to be indexed by Googlebot for Gemini to see it. Adding <link rel="alternate" href="/llms.txt"> in your HTML ensures Google indexes the llms.txt relationship. JSON-LD structured data also survives the indexing pipeline.

What to do about it

Action	Helps with
Serve `llms.txt` with clean markdown	ChatGPT, Perplexity
Add `<link rel="alternate" href="/llms.txt">`	Gemini (via Google index)
Add JSON-LD structured data	Gemini (via Google index)
Don't block `Google-Extended` in robots.txt	Gemini
Use RFC 9421 signatures for bot auth	ChatGPT Agent Mode verification
Serve structured content with clear headings	Perplexity span extraction