llms.txt and llms-full.txt: A Sitemap, Rewritten for Language Models

For twenty years the machine-readable map of a website has been sitemap.xml: a flat XML list of every URL, each with a last-modified date and a crawl-priority hint. It was built for one consumer, the search-engine crawler, and for one job, telling that crawler which pages exist and how often to come back. It does that job well. It does nothing else.

Language models arrived with a different set of questions, and sitemap.xml answers none of them. A crawler feeding a model does not primarily want an exhaustive URL list to schedule; it wants to know what this site is about, which handful of documents actually matter, and whether it can get the content without rendering a page full of navigation, scripts, and cookie banners. Out of several proposals for filling that gap, two files have sailed up and stuck: llms.txt and llms-full.txt.

What sitemap.xml is, and where it stops

sitemap.xml is an index of locations. Its unit is the URL. It is written in XML for machines, is never meant to be read by a human, and carries no content and no editorial judgement: every URL sits in the same flat list, and <priority> is a crawl hint, not a statement about what a reader should look at first. Crucially, every entry points at an HTML page, the expensive, chrome-wrapped representation a model then has to fetch and strip down before it can use a word of it.

For a search engine that is exactly right. For a model assembling context, it is the wrong shape on every axis: wrong format, no curation, no content, and it points at the heaviest possible representation of each page.

What llms.txt actually is

llms.txt is a single Markdown file at the root of the domain, /llms.txt, proposed in late 2024 (the convention lives at llmstxt.org). It is not an exhaustive URL dump. It is a curated index, written in Markdown so that it is equally legible to a person and trivial to parse for a model. A minimal one looks like this:

# Troll Tech Ltd

> Software studio building retro-flavoured, standards-first web systems.

## Pages
- [Home](https://troll-tech.co.uk/index.md): Company overview, services, and stack.
- [Privacy & Terms](https://troll-tech.co.uk/terms.md): Policy and registration details.

## Field Notes
- [GEO vs SEO](https://troll-tech.co.uk/notes/geo-vs-seo.md): Where the boundary is.
- [Serving Markdown to Bots](https://troll-tech.co.uk/notes/markdown-content-negotiation.md)

Three things make it a different artifact from sitemap.xml, not a reformatting of it:

It is curated, not exhaustive. It lists the documents worth reading, with a one-line description of each, in an order that reflects importance. That editorial layer is the whole point; an indiscriminate list would defeat it.
It is human-first Markdown. The same file a model parses is one a person can open and understand. No XML schema, no tooling required.
It links to clean content, not chrome. The convention is to point at Markdown companions (the .md representations of each page), so a crawler follows the index straight into ingestible prose rather than into a rendered HTML shell.

What llms-full.txt adds

llms.txt is still an index: a model reading it learns the shape of the site and then has to fetch each linked document. llms-full.txt collapses that second step. It is the entire site, concatenated into one Markdown document: every page and article inlined, in reading order, in a single file.

The reason it exists is the way models consume information. A model does not browse; it fills a context window. Handing it one file that already contains everything means:

One fetch instead of dozens. No crawl, no link-following, no per-page render. The whole corpus arrives in a single request.
No reconstruction. The model does not have to stitch a site together from fragments; it is already assembled in order.
Token-efficient input. Plain Markdown with no markup, scripts, or styling is a fraction of the tokens that the equivalent rendered HTML would cost.

The division of labour is clean: llms.txt is the table of contents, for when a model or agent wants to choose what to read; llms-full.txt is the whole book, for when it just wants everything at once.

sitemap.xml vs llms.txt, side by side

	`sitemap.xml`	`llms.txt` / `llms-full.txt`
Consumer	Search crawler	Language model / AI agent
Format	XML	Markdown
Human-readable	No	Yes
Content	None (URLs only)	Descriptions; full text in `-full`
Selection	Exhaustive	Curated by importance
Points at	HTML pages	Clean Markdown
Job	Crawl scheduling	Comprehension and ingestion

They are not rivals. A site should ship both: sitemap.xml still does its job for search, and llms.txt does the job search never asked of it.

The honest caveat on adoption

This is a convention that has gained real traction, not a ratified standard with guaranteed consumers. Plenty of sites and documentation platforms now emit llms.txt automatically, and the format is stable enough to rely on. But as of this writing no major model provider has publicly committed to ingesting llms.txt at inference or training time, and some search voices have been openly skeptical, comparing it to the long-dead keywords meta tag. Treat the upside as real but unguaranteed: you are publishing a clean, cheap, well-structured front door for the crawlers that do choose to use it, at near-zero cost, rather than betting the site on it.

The cost-to-benefit ratio is what makes it an easy call. The files are generated from content you already have, they cannot hurt you, and the moment a crawler decides to prefer a curated Markdown index over scraping rendered HTML, you are already there.

How this site does it

Both files are build artifacts here, regenerated on every build from the same content the site renders. After each locale builds, a script reads that locale's content bundle and Field Notes and writes llms.txt (the curated index) and llms-full.txt (the full corpus) into the site root, alongside per-page .md companions that the index links to. robots.txt points crawlers at llms.txt, exactly as it points search engines at sitemap.xml. The two maps sit side by side at the root of the domain, each addressed to the consumer it was designed for.