# Serving Markdown to Bots with Content Negotiation

A web page and the information inside it are not the same thing. The page is a delivery
mechanism (layout, scripts, hydration markers, a tree of wrapper elements) wrapped around
a much smaller payload of actual content. Humans need the delivery mechanism. **An AI
crawler does not.** It wants the content, and everything else is overhead it pays for on
every fetch and then throws away.

This is the core insight behind shipping a Markdown companion for every page. On a
statically prerendered site, like this one, the companions are just files: each route is
rendered to HTML at build time, and a matching `.md` is emitted next to it, advertised
through a `text/markdown` alternate link in the head and indexed in `llms.txt`. There is no
server in the path, so there is nothing to optimize at request time.

The interesting case is **server-side rendering**, where every page view costs you a real
render. There, the Markdown companion stops being a nicety and becomes a load-shedding
mechanism, so long as you serve it through *content negotiation* rather than only at a
separate `.md` URL.

## What content negotiation buys you

Content negotiation is the HTTP mechanism for serving different *representations* of the
same resource at the same URL, chosen from what the client signals. The canonical signal is
the `Accept` header. A browser sends `Accept: text/html`; a crawler that understands the
arrangement can send `Accept: text/markdown` and get the lightweight representation back
from the identical URL: no separate link to discover, no second URL to maintain.

The trick is to honour only an *explicit* signal:

```ts
// True only when the client explicitly lists text/markdown. A wildcard (*/*)
// or text/html stays HTML, so browsers, which never ask for markdown, are
// completely unaffected.
function acceptsMarkdown(req: Request): boolean {
  const accept = req.headers["accept"];
  const acceptStr = Array.isArray(accept) ? accept[0] : accept;
  return /\btext\/markdown\b/i.test(acceptStr ?? "");
}
```

In practice most AI crawlers don't send `Accept: text/markdown` yet, so a second signal does
the heavy lifting: the **User-Agent**. If you recognise the crawler by name, you can serve
it Markdown categorically, no header cooperation required.

## A middleware that intercepts before the renderer

The whole thing fits in one Express middleware mounted *ahead* of both the static handler
and the Angular SSR engine. We run a version of this on a sister site; the shape is:

```ts
server.use(compression());

// Serve markdown to AI bots BEFORE Angular ever renders. The middleware
// short-circuits the request for known crawlers; everyone else falls
// through to the normal SSR path untouched.
server.use(botMarkdownMiddleware(browserDistFolder));

server.use(express.static(browserDistFolder, { maxAge: "1y", index: false }));
server.get("*", /* ... Angular CommonEngine render ... */);
```

The middleware decides, per request, whether this is a Markdown request at all. There are
three independent triggers:

1. **A known bot User-Agent**, matched against an explicit allow-list.
2. **An explicit `.md` suffix**: `/about.md` strips to `/about` and serves it as Markdown
   regardless of who is asking. This is the public, documentable URL.
3. **`Accept: text/markdown`**, the standards-compliant path for any client that opts in.

If none of those fire, the request is a normal page view and falls straight through to SSR.
The crawler path never even reaches the renderer.

### Matching bots without cloaking Google

The User-Agent list is deliberately narrow and deliberately *excludes* search crawlers:

```ts
// AI-specific crawlers only. Googlebot and Google-Extended are deliberately
// absent: \b word boundaries prevent partial matches, so Google's search
// crawler keeps receiving the full HTML page. Serving it stripped-down
// markdown would be cloaking.
const BOT_PATTERN =
  /\b(GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|Claude-SearchBot|claude-web|anthropic-ai|PerplexityBot|CCBot|Bytespider|Applebot|Meta-ExternalAgent|FacebookBot|Amazonbot)\b/i;
```

This is the line worth being careful about. "Serve different content to bots than to users"
is the definition of cloaking, and search engines penalise it. The defence is that **this
isn't different content, it's a different representation of the same content**, the way a
`print.css` or an RSS feed is. To keep that defensible:

- **Google's search crawler is excluded on purpose.** It gets the same HTML a human gets.
  Only generative/AI agents (which are asking for ingestible text, not a rendered page)
  get Markdown.
- The Markdown is generated *from* the page's own content, so the substance matches.
- If no Markdown exists for a route, the middleware **falls through to SSR** rather than
  serving a stub. A bot never gets a worse page than a human; worst case it gets the same
  page.

### Don't poison the cache: `Vary`

The one operational hazard of content negotiation is caching. If a CDN caches the Markdown
response under the bare URL, the next human gets Markdown, or vice versa. The fix is the
`Vary` header, which tells caches the response depends on specific request headers:

```
Vary: User-Agent, Accept
Content-Type: text/markdown; charset=utf-8
Cache-Control: public, max-age=1800
```

Apply `Vary: User-Agent, Accept` to the **dynamic** responses (HTML and Markdown both), but
*not* to genuinely static assets: CSS, JS, fonts, images. Keying static files by
User-Agent would shatter the cache into one entry per browser string for no benefit. So the
middleware lets static assets pass through untouched and only annotates the negotiated
routes.

## Why this is cheaper, in concrete terms

Under SSR, producing one HTML page means bootstrapping the framework, running the component
tree, fetching whatever data the page needs, serializing the DOM, and adding hydration
state. It is CPU-bound and it is the single most expensive thing your server does per
request.

The Markdown branch skips all of it. It reads a cached string (or runs a lightweight HTML-to
-Markdown conversion of already-fetched content) and writes a few kilobytes. Roughly:

- **No framework bootstrap, no render tree, no serialization** on the bot path.
- **A fraction of the bytes** over the wire: no markup, scripts, or styles.
- **Cacheable for everyone** behind the same UA/Accept key, with a short TTL.

The payload a crawler ingests is also *better* for it: clean prose with headings and lists,
no navigation chrome or boilerplate to wade through. You spend less and the model gets a
cleaner signal.

## The rate-limit dividend

Here is the part that changes how you operate the site. AI crawlers are aggressive: GPTBot,
ClaudeBot, PerplexityBot and friends can hit a site hard, and the usual defensive reflex is
to **throttle them**: rate limits, `Crawl-delay`, outright blocks when they spike. Every one
of those measures trades away coverage. A throttled crawler indexes less of you, and in a
world where citation in AI answers depends on having been ingested in the first place,
throttling is self-defeating.

Content negotiation flips the trade-off. Once a crawler's requests cost you almost nothing
to serve, **you no longer need to throttle to protect the origin.** You can:

- Raise or remove rate limits for recognised AI agents, because their traffic no longer
  competes with human SSR traffic for CPU.
- Let them crawl deeper and more often: fresher content in the models that cite you.
- Stop treating crawler spikes as an incident, because a spike of Markdown requests is a
  rounding error next to a spike of renders.

You convert an adversarial relationship (bots as a load problem to be fended off) into a
cooperative one: cheap to serve, generously crawled, well represented in the answers that
increasingly sit between your content and its audience. That is the same GEO logic behind
shipping Markdown companions and an `llms.txt` index at all; content negotiation is just
what makes it *operationally* free under SSR.

## When you need it, and when you don't

If your site is statically prerendered, you may not need the middleware at all: ship the
`.md` files as static artifacts and let the CDN serve them for free. The negotiation layer
earns its keep precisely when **rendering is expensive and per-request**, which is where
handing bots a cheap representation stops being a nicety and starts paying for itself in CPU
you don't burn and rate limits you don't have to enforce.