linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

I was wrong about robots.txt

Technology

23 Beiträge 8 Kommentatoren 4 Aufrufe

K This user is from outside of this forum
K This user is from outside of this forum
karlheinzschwuke@feddit.org

schrieb zuletzt editiert von

#1

This post did not contain any content.

I was wrong about robots.txt

Recently, I wrote an article about my journey in learning about robots.txt and its implications on the data rights in regards to what I write in my blog. I was confident that I wanted to ban all the crawlers from my website. Turned out there was an unintended consequence that I did not account for. My LinkedIn posts became broken Ever since I changed my robots.txt file, I started seeing that my LinkedIn posts no longer had the preview of the article available. I was not sure what the issue was initially, since before then it used to work just fine. In addition to that, I have noticed that LinkedIn’s algorithm has started serving my posts to fewer and fewer connections. I was a bit confused by the issue, thinking that it might have been a temporary problem. But over the next two weeks the missing post previews did not appear.

Evgenii Pendragon (evgeniipendragon.com)
I T G 3 Antworten Letzte Antwort

82
K karlheinzschwuke@feddit.org

This post did not contain any content.

I was wrong about robots.txt

Recently, I wrote an article about my journey in learning about robots.txt and its implications on the data rights in regards to what I write in my blog. I was confident that I wanted to ban all the crawlers from my website. Turned out there was an unintended consequence that I did not account for. My LinkedIn posts became broken Ever since I changed my robots.txt file, I started seeing that my LinkedIn posts no longer had the preview of the article available. I was not sure what the issue was initially, since before then it used to work just fine. In addition to that, I have noticed that LinkedIn’s algorithm has started serving my posts to fewer and fewer connections. I was a bit confused by the issue, thinking that it might have been a temporary problem. But over the next two weeks the missing post previews did not appear.

Evgenii Pendragon (evgeniipendragon.com)
I This user is from outside of this forum
I This user is from outside of this forum
ineedmana@piefed.zip

schrieb zuletzt editiert von

#2

Huh. So in this case, the file actually is respected. Refreshing
E T 2 Antworten Letzte Antwort

18
I ineedmana@piefed.zip

Huh. So in this case, the file actually is respected. Refreshing
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#3

Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won't just list you for search, but will also scrape your contents for Google's AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it's microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn't going to improve.
G 1 Antwort Letzte Antwort

25
K karlheinzschwuke@feddit.org

This post did not contain any content.

I was wrong about robots.txt

Recently, I wrote an article about my journey in learning about robots.txt and its implications on the data rights in regards to what I write in my blog. I was confident that I wanted to ban all the crawlers from my website. Turned out there was an unintended consequence that I did not account for. My LinkedIn posts became broken Ever since I changed my robots.txt file, I started seeing that my LinkedIn posts no longer had the preview of the article available. I was not sure what the issue was initially, since before then it used to work just fine. In addition to that, I have noticed that LinkedIn’s algorithm has started serving my posts to fewer and fewer connections. I was a bit confused by the issue, thinking that it might have been a temporary problem. But over the next two weeks the missing post previews did not appear.

Evgenii Pendragon (evgeniipendragon.com)
T This user is from outside of this forum
T This user is from outside of this forum
thedruid@lemmy.world

schrieb zuletzt editiert von

#4

So. If I can add something here for everyone's benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren't know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn't know, doesn't mean it hasn't crawled.
Just doesn't display the results based on your settings.
E 1 Antwort Letzte Antwort

34
K karlheinzschwuke@feddit.org

This post did not contain any content.

I was wrong about robots.txt

Recently, I wrote an article about my journey in learning about robots.txt and its implications on the data rights in regards to what I write in my blog. I was confident that I wanted to ban all the crawlers from my website. Turned out there was an unintended consequence that I did not account for. My LinkedIn posts became broken Ever since I changed my robots.txt file, I started seeing that my LinkedIn posts no longer had the preview of the article available. I was not sure what the issue was initially, since before then it used to work just fine. In addition to that, I have noticed that LinkedIn’s algorithm has started serving my posts to fewer and fewer connections. I was a bit confused by the issue, thinking that it might have been a temporary problem. But over the next two weeks the missing post previews did not appear.

Evgenii Pendragon (evgeniipendragon.com)
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#5

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.
A 1 Antwort Letzte Antwort

73
E ell1e@leminal.space

Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

For example, Googlebot if enabled won't just list you for search, but will also scrape your contents for Google's AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it's microsoft, will feed some other AI of theirs as well on top of the previews.

Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn't going to improve.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#6

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.
C E 2 Antworten Letzte Antwort

0
G general_effort@lemmy.world

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.
C This user is from outside of this forum
C This user is from outside of this forum
cecilkorik@lemmy.ca

schrieb zuletzt editiert von

#7

Absolutely true. They'll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their "data partner" wasn't respecting robots.txt if they have to, which they won't ever have to do because it's literally impossible to detect and prove and realistically unenforceable.

This is a company that removed it's company motto of "Don't be evil" because it found it too "limiting". Don't be naive.
G 1 Antwort Letzte Antwort

2
C cecilkorik@lemmy.ca

Absolutely true. They'll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their "data partner" wasn't respecting robots.txt if they have to, which they won't ever have to do because it's literally impossible to detect and prove and realistically unenforceable.

This is a company that removed it's company motto of "Don't be evil" because it found it too "limiting". Don't be naive.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#8

That's very different from what I called false.

What you describe may happen, but probably not as much as you think. Much of that stuff is just not that valuable. Some personal, colloquial writing is necessary, but Google already pays Reddit. Other stuff is better obtained from torrents or shadow libraries like Anna's Archive.
1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#9

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
G 1 Antwort Letzte Antwort

6
T thedruid@lemmy.world

So. If I can add something here for everyone's benefit

No search engine really obeys robots.txt

Their publicly acknowledged crawlers do, but they have other crawlers that aren't know that ignore the file.

Google knows every inch of your site, allowed or not.

See, just because a search engine says it doesn't know, doesn't mean it hasn't crawled.
Just doesn't display the results based on your settings.
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von

#10

And allowing the public crawler might also have it feed their AI: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/
1 Antwort Letzte Antwort

6
G general_effort@lemmy.world

What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.
A This user is from outside of this forum
A This user is from outside of this forum
archr@lemmy.world

schrieb zuletzt editiert von archr@lemmy.world

#11

I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
G 1 Antwort Letzte Antwort

14
E ell1e@leminal.space

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it's false, I'd be curious.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von general_effort@lemmy.world

#12

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

ETA: I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually claiming that Googlebot collects AI training data. He's talking about the AI overview, which is a search feature. The data for search features is collected by Googlebot. I'm not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
E 1 Antwort Letzte Antwort

1
A archr@lemmy.world

I feel like most casual users would not make the connection of "crawlers" to link previews that they talk about it the article.

Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#13

that is not how general news media has been talking about robots.txt.

Ahh, yes. I think there is a lesson there.
1 Antwort Letzte Antwort

4
G general_effort@lemmy.world

Ok. That quotes a tweet by Cloudflare's CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

ETA: I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually claiming that Googlebot collects AI training data. He's talking about the AI overview, which is a search feature. The data for search features is collected by Googlebot. I'm not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.

Here's Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#14

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
G 1 Antwort Letzte Antwort

2
E ell1e@leminal.space

So what's the quote from the documentation that backs up your claim? The line "perform other product specific crawls" seems extremely vague by design.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#15

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
E 1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

I'm not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#16

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
G 1 Antwort Letzte Antwort

1
E ell1e@leminal.space

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I'd be happy to learn.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#17

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?
E 1 Antwort Letzte Antwort

1
G general_effort@lemmy.world

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for "AI", or "training", or some such, or skim through. It's not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#18

You look up what Googlebot does. No AI.

The page seems written to perhaps suggest it but doesn't explicitly say the other bots can't feed into some other sort of AI training. It would be in Google's interest to mislead the users here.

Edit: I found a quote where it says Googlebot does both in one: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent [...]" and I guess Cloudflare doesn't trust Google to abide by the access controls. That seems sensible to me. Edit 2: What exactly the CEO believes was perhaps rightfully disputed below, it was just my guess.
G 1 Antwort Letzte Antwort

1
E ell1e@leminal.space

You look up what Googlebot does. No AI.

The page seems written to perhaps suggest it but doesn't explicitly say the other bots can't feed into some other sort of AI training. It would be in Google's interest to mislead the users here.

Edit: I found a quote where it says Googlebot does both in one: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent [...]" and I guess Cloudflare doesn't trust Google to abide by the access controls. That seems sensible to me. Edit 2: What exactly the CEO believes was perhaps rightfully disputed below, it was just my guess.
G This user is from outside of this forum
G This user is from outside of this forum
general_effort@lemmy.world

schrieb zuletzt editiert von

#19

It would be a lot to write, if you had to say what something does not do rather than what it does.

I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually backing you up. He's saying that Google makes no difference between the AI overview and the other search results. That is true. The AI overview is a search feature. I'm not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.
E 1 Antwort Letzte Antwort

0
G general_effort@lemmy.world

It would be a lot to write, if you had to say what something does not do rather than what it does.

I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually backing you up. He's saying that Google makes no difference between the AI overview and the other search results. That is true. The AI overview is a search feature. I'm not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.
E This user is from outside of this forum
E This user is from outside of this forum
ell1e@leminal.space

schrieb zuletzt editiert von ell1e@leminal.space

#20

But the article later does back it up: "Although Cloudflare singled out Google, other search engines that view AI search features as part of their search products also use the same bots for training as they do for search indexing."

In any case, I'm okay with admitting neither you nor me can look inside Google to see they're doing. But the claims are out there, I didn't make them up, whether they're true or not. Thank you for the certainly interesting Google crawler info link.
G 1 Antwort Letzte Antwort

0

Anmelden zum Antworten

R

BREAKING: X CEO Linda Yaccarino Steps Down One Day After Elon Musk’s Grok AI Bot Went Full Hitler
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
195

1

1k Stimmen

195 Beiträge

1k Aufrufe

W

It doesn't because you're not just arching the original message but any comments and reactions that message receives as well
D

Trump’s tax bill seeks to prevent AI regulations. Experts fear a heavy toll on the planet
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
6

1

76 Stimmen

6 Beiträge

50 Aufrufe

E

We all know how well not regulating social media has gone, why the fuck not let's just double down.
M

Pirate Software "Stop Killing Games" Drama
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

37 Stimmen

9 Beiträge

49 Aufrufe

V

Crazy how big of a following he has after the drama with Only Fangs at the beginning of he year.
H

Seven Goldfish
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

5 Stimmen

1 Beiträge

11 Aufrufe

Niemand hat geantwortet
P

Why is the manosphere on the rise? UN Women sounds the alarm over online misogyny
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
351

1

438 Stimmen

351 Beiträge

2k Aufrufe

G

"I hate it when misandry pops up on my feed" Word for word. I posted that 5 weeks ago and I'm still getting hate for it.
G

I Counted All of the Yurts in Mongolia Using Machine Learning
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

17 Stimmen

9 Beiträge

54 Aufrufe

G

I'd say, when there's a policy and its goals aren't reached, that's a policy failure. If people don't like the policy, that's an issue but it's a separate issue. It doesn't seem likely that people prefer living in tents, though. But to be fair, the government may be doing the best it can. It's ranked "Flawed Democracy" by The Economist Democracy Index. That's really good, I'd say, considering the circumstances. They are placed slightly ahead of Argentina and Hungary. OP has this to say: Due to the large number of people moving to urban locations, it has been difficult for the government to build the infrastructure needed for them. The informal settlements that grew from this difficulty are now known as ger districts. There have been many efforts to formalize and develop these areas. The Law on Allocation of Land to Mongolian Citizens for Ownership, passed in 2002, allowed for existing ger district residents to formalize the land they settled, and allowed for others to receive land from the government into the future. Along with the privatization of land, the Mongolian government has been pushing for the development of ger districts into areas with housing blocks connected to utilities. The plan for this was published in 2014 as Ulaanbaatar 2020 Master Plan and Development Approaches for 2030. Although progress has been slow (Choi and Enkhbat 7), they have been making progress in building housing blocks in ger distrcts. Residents of ger districts sell or exchange their plots to developers who then build housing blocks on them. Often this is in exchange for an apartment in the building, and often the value of the apartment is less than the land they originally had (Choi and Enkhbat 15). Based on what I’ve read about the ger districts, they have been around since at least the 1970s, and progress on developing them has been slow. When ineffective policy results in a large chunk of the populace generationally living in yurts on the outskirts of urban areas, it’s clear that there is failure. Choi, Mack Joong, and Urandulguun Enkhbat. “Distributional Effects of Ger Area Redevelopment in Ulaanbaatar, Mongolia.” International Journal of Urban Sciences, vol. 24, no. 1, Jan. 2020, pp. 50–68. DOI.org (Crossref), https://doi.org/10.1080/12265934.2019.1571433.
M

autofocus glasses
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
53

1

126 Stimmen

53 Beiträge

249 Aufrufe

M

Hm. Checking my glasses I think there is something on the top too. I can see distance ever so slightly clearer looking out the top. If I remember right, I have a minus .25 in one eye. Always been told it didn't need correction, but maybe it is in this pair. I should go get some off the shelf progressive readers and try those.
P

Telegram and xAI agreed a one-year deal to integrate Grok into the chat app; Telegram will get $300M in cash and equity from xAI and 50% of subscription revenue.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
131

2

272 Stimmen

131 Beiträge

419 Aufrufe

E

This is good to know. I hadn't read the fine print, because I abandoned Telegram and never looked back. I hope its true and I agree, I also wouldn't think they'd do this and then renege into a possible lawsuit.