CCBot
CCBot builds Common Crawl's open web dataset, widely reused to train AI models. Here is what it does, its user-agent string, and how to opt out.
What is CCBot?
CCBot is the crawler of the Common Crawl Foundation, a non-profit that builds “an open repository of web crawl data that is universally accessible.” The dataset is used for research, but it is also one of the most widely reused sources of AI model training data downstream.
These bots collect web content to train future AI models. Blocking them keeps your content out of training data — it costs you no traffic, because training crawlers never send visitors.
The CCBot user-agent string
This is the user-agent string Common Crawl documents for CCBot. You will see it in your server logs when the bot visits.
CCBot/2.0 (https://commoncrawl.org/faq/)
Common Crawl warns that some crawlers falsely identify themselves as CCBot, and offers reverse-DNS verification against its dedicated IP ranges to confirm authenticity.
How do I block CCBot in robots.txt?
Add one of these snippets to the robots.txt file at the root of your domain. An explicit group for CCBot overrides your User-agent: * rules for this bot.
Block CCBot
Tells CCBot it may not access any page on your site.
User-agent: CCBot Disallow: /
Allow CCBot
Explicitly allows CCBot, even when a broad Disallow rule blocks other bots.
User-agent: CCBot Allow: /
Does CCBot respect robots.txt?
Compliance is implied rather than stated as policy: Common Crawl gives opt-out instructions — add a User-agent: CCBot / Disallow: / group to your robots.txt — which only works because CCBot honors it. There is no separate formal “we comply” sentence.
Should you block CCBot?
Blocking CCBot is worth considering because its open dataset is reused far beyond Common Crawl's own research mission — many AI models are trained on Common Crawl data. If keeping your content out of AI training corpora matters to you, blocking CCBot is one of the highest-leverage single rules you can add.
Official documentation
The facts on this page come from Common Crawl's CCBot page. Bot behavior changes — when in doubt, the operator's page is the source of truth:
https://commoncrawl.org/ccbotWhat does your robots.txt say about CCBot?
Run your domain through our free checker to see whether CCBot — and 13 other AI crawlers — may access your site right now.