linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Technology

77 Beiträge 57 Kommentatoren 0 Aufrufe

P This user is from outside of this forum
P This user is from outside of this forum
pro@programming.dev

schrieb zuletzt editiert von

#1

cross-posted from: https://programming.dev/post/35852706

Source.
0 P U K X 15 Antworten Letzte Antwort

538
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
0 This user is from outside of this forum
0 This user is from outside of this forum
0x0@lemmy.zip

schrieb zuletzt editiert von

#2

It's always a cat-n-mouse game.
A 1 Antwort Letzte Antwort

10
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
P This user is from outside of this forum
P This user is from outside of this forum
philipthebucket@piefed.social

schrieb zuletzt editiert von

#3

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
T T N T 4 Antworten Letzte Antwort

65
P philipthebucket@piefed.social

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
T This user is from outside of this forum
T This user is from outside of this forum
traches@sh.itjust.works

schrieb zuletzt editiert von

#4

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP
P A S 3 Antworten Letzte Antwort

7
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
U This user is from outside of this forum
U This user is from outside of this forum
underpantsweevil@lemmy.world

schrieb zuletzt editiert von

#5

I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.
D W 2 Antworten Letzte Antwort

30
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
K This user is from outside of this forum
K This user is from outside of this forum
kyrgizion@lemmy.world

schrieb zuletzt editiert von

#6

Eventually we'll have "defensive" and "offensive" llm's managing all kinds of electronic warfare automatically, effectively nullifying each other.
P S C 3 Antworten Letzte Antwort

10
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
X This user is from outside of this forum
X This user is from outside of this forum
xxce2aab@feddit.dk

schrieb zuletzt editiert von

#7

If this isn't fertile grounds for a massive class-action lawsuit, I don't know what would be.
D 1 Antwort Letzte Antwort

5
P philipthebucket@piefed.social

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
T This user is from outside of this forum
T This user is from outside of this forum
turbowafflz@lemmy.world

schrieb zuletzt editiert von

#8

I think the best thing to do is to not block them when they're detected but poison them instead. Feed them tons of text generated by tiny old language models, it's harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn't slow down real users, but you probably don't need much power since the scrapers probably don't really care about the speed
X 3 S P 4 Antworten Letzte Antwort

69
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
S This user is from outside of this forum
S This user is from outside of this forum
sufferingsteve@feddit.nu

schrieb zuletzt editiert von

#9

There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the AI crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.
M T G K M 5 Antworten Letzte Antwort

144
U underpantsweevil@lemmy.world

I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.
D This user is from outside of this forum
D This user is from outside of this forum
devfuuu@lemmy.world

schrieb zuletzt editiert von devfuuu@lemmy.world

#10

I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There's not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.
1 Antwort Letzte Antwort

15
T traches@sh.itjust.works

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP
P This user is from outside of this forum
P This user is from outside of this forum
philipthebucket@piefed.social

schrieb zuletzt editiert von

#11

Is that really true? I guess I have no reason to doubt it, I just hadn't heard it before.
S 1 Antwort Letzte Antwort

0
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
S This user is from outside of this forum
S This user is from outside of this forum
sailorzoop@lemmy.librebun.com

schrieb zuletzt editiert von

#12

I'm ashamed to say that I switched my DNS nameservers to CF just for their anti crawler service.
Knowing Cloudflare, god know how much longer it'll be free for.
A 1 Antwort Letzte Antwort

6
K kyrgizion@lemmy.world

Eventually we'll have "defensive" and "offensive" llm's managing all kinds of electronic warfare automatically, effectively nullifying each other.
P This user is from outside of this forum
P This user is from outside of this forum
prodigalfrog@slrpnk.net

schrieb zuletzt editiert von

#13

That's actually a major plot point in Cyberpunk 2077. There's thousands of rogue AI's on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.
K T A 3 Antworten Letzte Antwort

16
P philipthebucket@piefed.social

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
N This user is from outside of this forum
N This user is from outside of this forum
nuxcom_90percent@lemmy.zip

schrieb zuletzt editiert von

#14

Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.

Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to... rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.
B 1 Antwort Letzte Antwort

6
P philipthebucket@piefed.social

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
T This user is from outside of this forum
T This user is from outside of this forum
tin@feddit.uk

schrieb zuletzt editiert von

#15

Wasn't this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?
1 Antwort Letzte Antwort

21
T turbowafflz@lemmy.world

I think the best thing to do is to not block them when they're detected but poison them instead. Feed them tons of text generated by tiny old language models, it's harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn't slow down real users, but you probably don't need much power since the scrapers probably don't really care about the speed
X This user is from outside of this forum
X This user is from outside of this forum
xthexder@l.sw0.com

schrieb zuletzt editiert von

#16

I love catching bots in tarpits, it's actually quite fun
1 Antwort Letzte Antwort

35
T traches@sh.itjust.works

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP
A This user is from outside of this forum
A This user is from outside of this forum
ambitiousprocess@piefed.social

schrieb zuletzt editiert von

#17

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That's why in Codeberg's response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from "shady apps." Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people's devices to route network traffic they're unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don't want.
P 1 Antwort Letzte Antwort

12
X xxce2aab@feddit.dk

If this isn't fertile grounds for a massive class-action lawsuit, I don't know what would be.
D This user is from outside of this forum
D This user is from outside of this forum
dreadbeef@lemmy.dbzer0.com

schrieb zuletzt editiert von

#18

whos the defendent, specifically?
X 1 Antwort Letzte Antwort

4
A ambitiousprocess@piefed.social

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That's why in Codeberg's response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from "shady apps." Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people's devices to route network traffic they're unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don't want.
P This user is from outside of this forum
P This user is from outside of this forum
philipthebucket@piefed.social

schrieb zuletzt editiert von

#19

Honestly, man, I get what you're saying, but also at some point all that stuff just becomes someone else's problem.

This is what people forget about the social contract: It goes both ways, it was an agreement for the benefit of all. The old way was that if you had a problem with someone, you showed up at their house with a bat / with some friends. That wasn't really the way, and so we arrived at this deal where no one had to do that, but then people always start to fuck over other people involved in the system thinking that that "no one will show up at my place with a bat, whatever I do" arrangement is a law of nature. It's not.
1 Antwort Letzte Antwort

10
S sufferingsteve@feddit.nu

There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the AI crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.
M This user is from outside of this forum
M This user is from outside of this forum
marshezezz@lemmy.blahaj.zone

schrieb zuletzt editiert von

#20

Capitalism is grand, innit. Wait, not grand, I meant to say cancer
1 Antwort Letzte Antwort

39

Anmelden zum Antworten

D

New study sheds light on ChatGPT’s alarming interactions with teens
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
39

1

144 Stimmen

39 Beiträge

147 Aufrufe

T

I don’t remember reading about sudden shocking numbers of people getting “Google-induced psychosis.” ChaptGPT and similar chatbots are very good at imitating conversation. Think of how easy it is to suspend reality online—pretend the fanfic you’re reading is canon, stuff like that. When those bots are mimicking emotional responses, it’s very easy to get tricked, especially for mentally vulnerable people. As a rule, the mentally vulnerable should not habitually “suspend reality.”
T

Steam Users Rally Behind Anti-Censorship Petition
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
244

1k Stimmen

244 Beiträge

5k Aufrufe

J

It's also the US legal standard for obscenity laws, unfortunately.
P

Help us understand the challenges patients face opting out of voluntary uses of their data, or getting access to their records.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

47 Stimmen

1 Beiträge

21 Aufrufe

Niemand hat geantwortet
K

Exclusive: OpenAI to release web browser in challenge to Google Chrome
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
28

54 Stimmen

28 Beiträge

382 Aufrufe

T

Also Servo is now under the Linux Foundation. Both this and Ladybird are very exciting.
F

UN reports a full internet blackout in Gaza today. What are the chances today's outage in the US is some kind of "overshot"?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
7

54 Stimmen

7 Beiträge

81 Aufrufe

F

After some further reading it seems obvious that the two incidents are entirely unrelated, but it was a fun rabbit hole for a sec!
A

Twitch is getting vertical livestreams
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
20

1

11 Stimmen

20 Beiträge

171 Aufrufe

Z

Oh, yeah, that makes sense. I kinda assumed they already supported it, like YouTube Shorts adopting the vertical format for shorts after Ticktock blew up.
A

FBI Wants Access To Encrypted iPhone And Android Data—So Does Europe
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
38

1

175 Stimmen

38 Beiträge

389 Aufrufe

W

It's not a back door, it's just a rear entryway
F

*deleted by creator*
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

0 Stimmen

1 Beiträge

21 Aufrufe

Niemand hat geantwortet