Skip to content
Back to Learn

llms.txt vs robots.txt: Different Jobs, Common Confusion

llms.txt vs robots.txt: one controls crawler access, the other guides AI systems through your content. How they differ, complement, and get confused.

Jul 5, 20265 minGuide

Last updated: July 2026

These two files get confused constantly, and the confusion causes real mistakes: site owners who think llms.txt blocks AI training, and site owners who think robots.txt tells AI assistants what their business does. Neither is true. The files sit next to each other at the root of your domain and do completely different jobs.

The one-line version: robots.txt is a bouncer, llms.txt is a menu. One decides who gets in; the other tells the guests who did get in what is worth ordering.


robots.txt: access control

robots.txt is the web's oldest crawler convention, dating back to 1994. It tells automated crawlers which parts of your site they may and may not visit, per bot:

User-agent: GPTBot
Disallow: /

User-agent: *
Disallow: /admin/

That is the whole job: access. It says nothing about what your content means — it only draws lines around where bots are allowed to go. And important to be honest about: it is a convention, not a lock. Well-behaved crawlers respect it; badly behaved ones ignore it.

The AI era stretched robots.txt

Because robots.txt is where crawler control already lives, the AI wave landed there first, and the file grew new AI-specific machinery:

Per-bot AI user agents. The major AI companies now operate multiple bots with separate jobs, each individually blockable. OpenAI runs three main ones: GPTBot (training), OAI-SearchBot (ChatGPT search indexing), and ChatGPT-User (live fetching when a user asks about your site). Anthropic mirrors this with ClaudeBot, Claude-SearchBot, and Claude-User. Blocking one does not block the others — you can opt out of training while staying reachable for live assistant answers.

Google-Extended. Google's control token (introduced September 2023) is not a separate crawler — Googlebot does the crawling either way. Disallowing Google-Extended tells Google not to use your content for training Gemini models or for grounding, without touching your presence in Google Search.

Content Signals. In September 2025, Cloudflare launched the Content Signals Policy: machine-readable preferences inside robots.txt that separate three uses of your content — search (classic indexing), ai-input (feeding AI answers at query time), and ai-train (model training):

Content-Signal: search=yes, ai-input=yes, ai-train=no

Cloudflare rolled these defaults out to the millions of domains using its managed robots.txt. Like everything else in robots.txt, signals express preferences rather than enforce them — but they make your intent unambiguous.

All of this evolution is about the same single question: who may take my content, and for what use.


llms.txt: content guidance

llms.txt answers a different question: for the AI systems that do read my site, what should they understand about it?

It is a curated markdown file — a summary of what your organization does, plus an annotated list of your most important pages — designed to be read at inference time, when an AI assistant is actively trying to answer a question about you. We cover the format in detail in What is llms.txt?.

The key structural difference: robots.txt is addressed to crawlers and enforced (voluntarily) at fetch time. llms.txt is addressed to language models and consumed as content. It has no directives, no allow/disallow, no user agents. It cannot keep anyone out, and robots.txt cannot explain anything. Each file is powerless at the other's job.

One more honest parallel: both files depend entirely on the reader's cooperation. robots.txt only stops crawlers that choose to respect it, and llms.txt only informs systems that choose to read it — adoption of which is real but uneven, as we document in Does llms.txt actually work?


Side by side

robots.txtllms.txt
JobAccess controlContent guidance
Answers"Where may bots go?""What is this site about?"
AudienceCrawlers, per user agentLanguage models and AI agents
FormatDirectives (Allow/Disallow, signals)Markdown (summary + annotated links)
Read atCrawl time, before fetching pagesInference time, while answering questions
Can block AI training?Yes — per-bot rules, Google-Extended, Content SignalsNo
Can describe your business?NoYes — that is the entire point
Standardized?De facto since 1994, formalized as RFC 9309Community proposal (2024), adoption uneven
Enforced?Voluntary complianceVoluntary consumption

Common misconceptions

"llms.txt blocks AI crawlers." It blocks nothing. It has no mechanism to — there are no directives in the format. If you want to keep GPTBot or ClaudeBot out, that is robots.txt work, and nothing else does it.

"robots.txt tells AI what my site is about." It cannot. The only thing an AI system learns from your robots.txt is which paths it may fetch. Everything about who you are and what you offer has to come from your content — or from a file designed to summarize it.

"I have robots.txt, so I don't need llms.txt" (or vice versa). They are not substitutes; neither covers the other's job. A site can sensibly have both, either, or neither depending on what it wants.

"Blocking AI bots in robots.txt makes llms.txt pointless." Only if you block everything. Many sites block training crawlers (GPTBot, ClaudeBot) but allow the live fetchers (ChatGPT-User, Claude-User) that answer user questions in real time — and those fetchers are precisely the readers llms.txt is for. Blanket-blocking every AI user agent while publishing an llms.txt is working against yourself.


Practical setup

A sane default for a business website that wants AI assistants to represent it accurately, without donating everything to training datasets:

  1. Decide your access policy in robots.txt. Allow the on-demand fetchers and search bots; block training bots if that is your position; consider adding Content Signals to make intent explicit. Then check your robots.txt with our AI crawler checker — it shows exactly which AI bots you currently allow or block, grouped by what each bot does. Accidentally blocking assistant fetchers with a blanket rule is the most common mistake we see.
  2. Publish an llms.txt that actually describes your business. Generate one, review it, and validate it against the spec.
  3. Get it live on your platform. Placement differs per platform — we have honest, tested guides for WordPress, Webflow, Squarespace, and Shopify (where the platform now serves a default file itself).
  4. Keep the two consistent. Do not list pages in llms.txt that your robots.txt forbids bots from fetching — you would be recommending a door you have locked.

Frequently Asked Questions

Can llms.txt override robots.txt?

No. If robots.txt disallows a path, a compliant bot will not fetch it, no matter what llms.txt says. Access rules always win; llms.txt operates entirely within whatever access you have granted.

Should llms.txt be listed in robots.txt?

There is no formal mechanism for it (robots.txt has a Sitemap: directive, but nothing for llms.txt). Just make sure robots.txt does not disallow /llms.txt itself — that is the failure mode to avoid.

Do I need a cookie-cutter AI block in robots.txt?

Copy-pasted "block all AI bots" lists are popular but blunt. They usually lump training crawlers together with the live fetchers that answer customer questions about you. Decide per category — training, search indexing, on-demand fetching — rather than per headline. Our checker shows the effect of your current rules per category.

Which file matters more?

For control over how your content is used: robots.txt, without question. For how accurately AI assistants describe you when they do have access: that is what llms.txt is for — with the adoption caveats laid out in our honest look at the data.

Ready to get started?

Generate your llms.txt in under a minute

Paste your URL and we do the rest. Free tier available.