linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

I was wrong about robots.txt

Technology

18 Beiträge 7 Kommentatoren 0 Aufrufe

G general_effort@lemmy.world

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#9

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
G 1 Antwort Letzte Antwort

3
T thedruid@lemmy.world

So. If I can add something here for everyone's benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren't know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn't know, doesn't mean it hasn't crawled.
Just doesn't display the results based on your settings.
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von

#10

And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/
1 Antwort Letzte Antwort

3
G general_effort@lemmy.world

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.
A This user is from outside of this forum
A This user is from outside of this forum
archr@lemmy.world

schrieb zuletzt editiert von archr@lemmy.world

#11

I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
G 1 Antwort Letzte Antwort

6
E ell1e@leminal.space

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#12

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
E 1 Antwort Letzte Antwort

1
A archr@lemmy.world

I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#13

that is not how general news media has been talking about robots.txt.

Ahh, yes. I think there is a lesson there.
1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#14

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
G 1 Antwort Letzte Antwort

2
E ell1e@leminal.space

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#15

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
E 1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#16

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
G 1 Antwort Letzte Antwort

1
E ell1e@leminal.space

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#17

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?
E 1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#18

You look up what Googlebot does. No AI.

The page seems written to perhaps suggest it but doesn't explicitly say the other bots can't feed into some other sort of AI training. It would be in Google's interest to mislead the users here.

Edit: I found a quote where it says Googlebot does both in one: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent [...]" and I guess Cloudflare doesn't trust Google to abide by the access controls. That seems sensible to me.
1 Antwort Letzte Antwort

1

Anmelden zum Antworten

D

As Data Centers Proliferate, Illinois Communities Grapple with How to Supply the Necessary Water. "This isn’t reused wastewater. This is drinking water”
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
30

1

279 Stimmen

30 Beiträge

130 Aufrufe

V

Relocate those Native American to reservations because those computers need a place to live. Or something like that.
D

Microsoft sued by authors over use of books in AI training
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
4

1

114 Stimmen

4 Beiträge

32 Aufrufe

I

The writers alleged in the complaint that Microsoft used a collection of nearly 200,000 pirated books to train Megatron, an algorithm that gives text responses to user prompts. Which Megatron are we referring to? This [image: c747568b-0dd5-431e-bd19-2fbfdf5d372c.webp] Or This [image: 735a9693-ec67-489c-92f6-addb803291a4.webp]
B

Is AI Apocalypse Inevitable? - Tristan Harris
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
11

1

121 Stimmen

11 Beiträge

61 Aufrufe

V

Define AGI, because recently the definition is shifting down to match LLM. In fact we can say we achieved AGI now because we have machine that answers questions. The problem will be when the number of questions will start shrinking not because of number of problems but number of people that understand those problems. That is what is happening now. Don't believe me, read the statistics about age and workforce. Now put it into urgent need to something to replace those people. After that think what will happen when all those attempts fail.
S

I Tried Pre-Ordering the Trump Phone. The Page Failed and It Charged My Credit Card the Wrong Amount
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
122

1

668 Stimmen

122 Beiträge

192 Aufrufe

T

It's something Americans say.
D

Apple acquires RAC7, its first-ever video game studio
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

0 Stimmen

1 Beiträge

12 Aufrufe

Niemand hat geantwortet
D

AI company files for bankruptcy after being exposed as 700 Indian engineers - Dexerto
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
133

1

846 Stimmen

133 Beiträge

508 Aufrufe

A

reminds me of the time when something with Amazon was Indian employees
F

A.I. Companies Believe They're Making God with Karen Hao [1:14:07]
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
8

45 Stimmen

8 Beiträge

50 Aufrufe

P

… it was
J

[China Technology] I drove the cheap Chinese cars that are illegal in the USA. Now I know why [36:49 | MAR 10 2025 | Rich Rebuilds]
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
13

51 Stimmen

13 Beiträge

64 Aufrufe

J

It is a possibility. Thanks for the input!