I was wrong about robots.txt
-
Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.
False.
See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
-
So. If I can add something here for everyone's benefit
No search engine really obeys robots.txt
Their publicly acknowledged crawlers do, but they have other crawlers that aren't know that ignore the file.
Google knows every inch of your site, allowed or not.
See, just because a search engine says it doesn't know, doesn't mean it hasn't crawled.
Just doesn't display the results based on your settings.And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/
-
What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.
I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.
Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
-
See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.
Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
-
I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.
Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
that is not how general news media has been talking about robots.txt.
Ahh, yes. I think there is a lesson there.
-
Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.
Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
-
So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
-
I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
-
Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
You look up what Googlebot does. No AI.
You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.
Did that help?
-
You look up what Googlebot does. No AI.
You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.
Did that help?
You look up what Googlebot does. No AI.
The page seems written to perhaps suggest it but doesn't explicitly say the other bots can't feed into some other sort of AI training. It would be in Google's interest to mislead the users here.
Edit: I found a quote where it says Googlebot does both in one: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent [...]" and I guess Cloudflare doesn't trust Google to abide by the access controls. That seems sensible to me.