Question 1

What is CCBot?

Accepted Answer

CCBot is the crawler of the Common Crawl Foundation, a non-profit that builds “an open repository of web crawl data that is universally accessible.” The dataset is used for research, but it is also one of the most widely reused sources of AI model training data downstream.

Question 2

What is the CCBot user-agent string?

Accepted Answer

The documented user-agent string is: CCBot/2.0 (https://commoncrawl.org/faq/)

Question 3

Does CCBot respect robots.txt?

Accepted Answer

Compliance is implied rather than stated as policy: Common Crawl gives opt-out instructions — add a User-agent: CCBot / Disallow: / group to your robots.txt — which only works because CCBot honors it. There is no separate formal “we comply” sentence.

Question 4

How do I block CCBot in robots.txt?

Accepted Answer

Add these two lines to your robots.txt: "User-agent: CCBot" followed by "Disallow: /". To explicitly allow it instead, use "Allow: /".

CCBot

What is CCBot?

The CCBot user-agent string

How do I block CCBot in robots.txt?

Block CCBot

Allow CCBot

Does CCBot respect robots.txt?

Should you block CCBot?

Official documentation

What does your robots.txt say about CCBot?