linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

I was wrong about robots.txt

Technology

17 Beiträge 7 Kommentatoren 0 Aufrufe

E ell1e@leminal.space

Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won't just list you for search, but will also scrape your contents for Google's AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it's microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn't going to improve.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#6

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.
C E 2 Antworten Letzte Antwort

0
G general_effort@lemmy.world

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.
C This user is from outside of this forum
C This user is from outside of this forum
cecilkorik@lemmy.ca

schrieb zuletzt editiert von

#7

Absolutely true. They'll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their "data partner" wasn't respecting robots.txt if they have to, which they won't ever have to do because it's literally impossible to detect and prove and realistically unenforceable.

This is a company that removed it's company motto of "Don't be evil" because it found it too "limiting". Don't be naive.
G 1 Antwort Letzte Antwort

2
C cecilkorik@lemmy.ca

Absolutely true. They'll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their "data partner" wasn't respecting robots.txt if they have to, which they won't ever have to do because it's literally impossible to detect and prove and realistically unenforceable.

This is a company that removed it's company motto of "Don't be evil" because it found it too "limiting". Don't be naive.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#8

That's very different from what I called false.

What you describe may happen, but probably not as much as you think. Much of that stuff is just not that valuable. Some personal, colloquial writing is necessary, but Google already pays Reddit. Other stuff is better obtained from torrents or shadow libraries like Anna's Archive.
1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#9

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
G 1 Antwort Letzte Antwort

3
T thedruid@lemmy.world

So. If I can add something here for everyone's benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren't know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn't know, doesn't mean it hasn't crawled.
Just doesn't display the results based on your settings.
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von

#10

And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/
1 Antwort Letzte Antwort

2
G general_effort@lemmy.world

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.
A This user is from outside of this forum
A This user is from outside of this forum
archr@lemmy.world

schrieb zuletzt editiert von archr@lemmy.world

#11

I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
G 1 Antwort Letzte Antwort

2
E ell1e@leminal.space

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#12

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
E 1 Antwort Letzte Antwort

1
A archr@lemmy.world

I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#13

that is not how general news media has been talking about robots.txt.

Ahh, yes. I think there is a lesson there.
1 Antwort Letzte Antwort

0
G general_effort@lemmy.world

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#14

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
G 1 Antwort Letzte Antwort

1
E ell1e@leminal.space

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#15

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
E 1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#16

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
G 1 Antwort Letzte Antwort

0
E ell1e@leminal.space

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#17

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

P

Germany deems DeepSeek as illegal content after it is unable to address data security concerns, and asks Apple and Google to block it from their app stores
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
47

1

279 Stimmen

47 Beiträge

167 Aufrufe

Z

Die mad about it :3 [image: cf6c5d73-a287-42a7-be2d-e80219312f02.webp]
V

The bizarre, dismal page you see if you open YouTube without an account.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
77

1

240 Stimmen

77 Beiträge

523 Aufrufe

J

bizarre, dismal What's bizarre and dismal is that someone is so starved for dopamine and attention from corporations that this is how they perceive what life looks like when you are not being targetted. This is my normal view and it is far better.
S

85K – A Melhor Opção para Quem Busca Diversão e Recompensas
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

1 Stimmen

1 Beiträge

13 Aufrufe

Niemand hat geantwortet
A

Amazon is reportedly training humanoid robots to deliver packages
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
143

1

300 Stimmen

143 Beiträge

556 Aufrufe

M

Yup, and people seem to frequently underestimate how ridiculously expensive running a fleet of humanoid robots would be (and don’t seem to realize how comparatively low the manual labor it’d replace is paid.)
G

More than a hundred backdoored malware repos traced to single GitHub user
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
4

1

153 Stimmen

4 Beiträge

29 Aufrufe

J

Agreed - the end of the article does state compiling untrusted repos is effectively the same as running an untrusted executable, and you should treat it with the same caution (especially if its malware or gaming cheat adjacent)
N

Ai Code Commits
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
37

1

164 Stimmen

37 Beiträge

185 Aufrufe

M

From what I know, those agents can be absolutely fantastic as long as they run under strict guidance of a senior developer who really knows how to use them. Fully autonomous agents sound like a terrible idea.
T

Why 3D-Printing an Untraceable Ghost Gun Is Easier Than Ever (Podcast 18mins)
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
19

1

11 Stimmen

19 Beiträge

69 Aufrufe

E

No, just laminated ones. Closed at one end. Easy enough to make or buy. You can even improvise the propellant.
A

Unhappy with the recently lost file upload feature in the Nextcloud app for Android? So are we. Let us explain. - Nextcloud
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
7

0 Stimmen

7 Beiträge

40 Aufrufe

C

Oh this is a good callout, I'm definitely using wired and not wireless.