# llms.txt and llms-full.txt: A Sitemap, Rewritten for Language Models

For twenty years the machine-readable map of a website has been `sitemap.xml`: a flat XML
list of every URL, each with a last-modified date and a crawl-priority hint. It was built
for one consumer, the search-engine crawler, and for one job, telling that crawler which
pages exist and how often to come back. It does that job well. It does nothing else.

Language models arrived with a different set of questions, and `sitemap.xml` answers none of
them. A crawler feeding a model does not primarily want an exhaustive URL list to schedule;
it wants to know **what this site is about, which handful of documents actually matter, and
whether it can get the content without rendering a page full of navigation, scripts, and
cookie banners.** Out of several proposals for filling that gap, two files have sailed up and
stuck: `llms.txt` and `llms-full.txt`.

## What sitemap.xml is, and where it stops

`sitemap.xml` is an index *of locations*. Its unit is the URL. It is written in XML for
machines, is never meant to be read by a human, and carries no content and no editorial
judgement: every URL sits in the same flat list, and `<priority>` is a crawl hint, not a
statement about what a reader should look at first. Crucially, every entry points at an
**HTML page**, the expensive, chrome-wrapped representation a model then has to fetch and
strip down before it can use a word of it.

For a search engine that is exactly right. For a model assembling context, it is the wrong
shape on every axis: wrong format, no curation, no content, and it points at the heaviest
possible representation of each page.

## What llms.txt actually is

`llms.txt` is a single Markdown file at the **root of the domain**, `/llms.txt`, proposed in
late 2024 (the convention lives at `llmstxt.org`). It is not an exhaustive URL dump. It is a
**curated index**, written in Markdown so that it is equally legible to a person and trivial
to parse for a model. A minimal one looks like this:

```markdown
# Troll Tech Ltd

> Software studio building retro-flavoured, standards-first web systems.

## Pages
- [Home](https://troll-tech.co.uk/index.md): Company overview, services, and stack.
- [Privacy & Terms](https://troll-tech.co.uk/terms.md): Policy and registration details.

## Field Notes
- [GEO vs SEO](https://troll-tech.co.uk/notes/geo-vs-seo.md): Where the boundary is.
- [Serving Markdown to Bots](https://troll-tech.co.uk/notes/markdown-content-negotiation.md)
```

Three things make it a different artifact from `sitemap.xml`, not a reformatting of it:

1. **It is curated, not exhaustive.** It lists the documents worth reading, with a
   one-line description of each, in an order that reflects importance. That editorial layer
   is the whole point; an indiscriminate list would defeat it.
2. **It is human-first Markdown.** The same file a model parses is one a person can open and
   understand. No XML schema, no tooling required.
3. **It links to clean content, not chrome.** The convention is to point at Markdown
   companions (the `.md` representations of each page), so a crawler follows the index
   straight into ingestible prose rather than into a rendered HTML shell.

## What llms-full.txt adds

`llms.txt` is still an *index*: a model reading it learns the shape of the site and then has
to fetch each linked document. `llms-full.txt` collapses that second step. It is **the entire
site, concatenated into one Markdown document**: every page and article inlined, in reading
order, in a single file.

The reason it exists is the way models consume information. A model does not browse; it fills
a context window. Handing it one file that already contains everything means:

- **One fetch instead of dozens.** No crawl, no link-following, no per-page render. The whole
  corpus arrives in a single request.
- **No reconstruction.** The model does not have to stitch a site together from fragments; it
  is already assembled in order.
- **Token-efficient input.** Plain Markdown with no markup, scripts, or styling is a fraction
  of the tokens that the equivalent rendered HTML would cost.

The division of labour is clean: `llms.txt` is the **table of contents**, for when a model or
agent wants to choose what to read; `llms-full.txt` is the **whole book**, for when it just
wants everything at once.

## sitemap.xml vs llms.txt, side by side

| | `sitemap.xml` | `llms.txt` / `llms-full.txt` |
|---|---|---|
| Consumer | Search crawler | Language model / AI agent |
| Format | XML | Markdown |
| Human-readable | No | Yes |
| Content | None (URLs only) | Descriptions; full text in `-full` |
| Selection | Exhaustive | Curated by importance |
| Points at | HTML pages | Clean Markdown |
| Job | Crawl scheduling | Comprehension and ingestion |

They are not rivals. A site should ship both: `sitemap.xml` still does its job for search,
and `llms.txt` does the job search never asked of it.

## The honest caveat on adoption

This is a convention that has gained real traction, not a ratified standard with guaranteed
consumers. Plenty of sites and documentation platforms now emit `llms.txt` automatically, and
the format is stable enough to rely on. But as of this writing **no major model provider has
publicly committed to ingesting `llms.txt` at inference or training time**, and some search
voices have been openly skeptical, comparing it to the long-dead keywords meta tag. Treat the
upside as real but unguaranteed: you are publishing a clean, cheap, well-structured front
door for the crawlers that do choose to use it, at near-zero cost, rather than betting the
site on it.

The cost-to-benefit ratio is what makes it an easy call. The files are generated from content
you already have, they cannot hurt you, and the moment a crawler decides to prefer a curated
Markdown index over scraping rendered HTML, you are already there.

## How this site does it

Both files are build artifacts here, regenerated on every build from the same content the
site renders. After each locale builds, a script reads that locale's content bundle and Field
Notes and writes `llms.txt` (the curated index) and `llms-full.txt` (the full corpus) into
the site root, alongside per-page `.md` companions that the index links to. `robots.txt`
points crawlers at `llms.txt`, exactly as it points search engines at `sitemap.xml`. The two
maps sit side by side at the root of the domain, each addressed to the consumer it was
designed for.
