linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges.

Technology

77 Beiträge 57 Kommentatoren 0 Aufrufe

P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
X This user is from outside of this forum
X This user is from outside of this forum
xxce2aab@feddit.dk

schrieb zuletzt editiert von

#7

If this isn't fertile grounds for a massive class-action lawsuit, I don't know what would be.
D 1 Antwort Letzte Antwort

5
P philipthebucket@piefed.social

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
T This user is from outside of this forum
T This user is from outside of this forum
turbowafflz@lemmy.world

schrieb zuletzt editiert von

#8

I think the best thing to do is to not block them when they're detected but poison them instead. Feed them tons of text generated by tiny old language models, it's harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn't slow down real users, but you probably don't need much power since the scrapers probably don't really care about the speed
X 3 S P 4 Antworten Letzte Antwort

72
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
S This user is from outside of this forum
S This user is from outside of this forum
sufferingsteve@feddit.nu

schrieb zuletzt editiert von

#9

There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the AI crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.
M T G K M 5 Antworten Letzte Antwort

155
U underpantsweevil@lemmy.world

I mean, we really have to ask ourselves - as a civilization - whether human collaboration is more important than AI data harvesting.
D This user is from outside of this forum
D This user is from outside of this forum
devfuuu@lemmy.world

schrieb zuletzt editiert von devfuuu@lemmy.world

#10

I think every company in the world is telling everyone for a few months now that what matter is AI data harvesting. There's not even a hint of it being a question. You either accept the AI overlords or get out of the internet. Our ONLY purpose it to feed the machine, anything else is irrelevant. Play along or you shall be removed.
1 Antwort Letzte Antwort

15
T traches@sh.itjust.works

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP
P This user is from outside of this forum
P This user is from outside of this forum
philipthebucket@piefed.social

schrieb zuletzt editiert von

#11

Is that really true? I guess I have no reason to doubt it, I just hadn't heard it before.
S 1 Antwort Letzte Antwort

0
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
S This user is from outside of this forum
S This user is from outside of this forum
sailorzoop@lemmy.librebun.com

schrieb zuletzt editiert von

#12

I'm ashamed to say that I switched my DNS nameservers to CF just for their anti crawler service.
Knowing Cloudflare, god know how much longer it'll be free for.
A 1 Antwort Letzte Antwort

7
K kyrgizion@lemmy.world

Eventually we'll have "defensive" and "offensive" llm's managing all kinds of electronic warfare automatically, effectively nullifying each other.
P This user is from outside of this forum
P This user is from outside of this forum
prodigalfrog@slrpnk.net

schrieb zuletzt editiert von

#13

That's actually a major plot point in Cyberpunk 2077. There's thousands of rogue AI's on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.
K T A 3 Antworten Letzte Antwort

16
P philipthebucket@piefed.social

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
N This user is from outside of this forum
N This user is from outside of this forum
nuxcom_90percent@lemmy.zip

schrieb zuletzt editiert von

#14

Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.

Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to... rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.
B 1 Antwort Letzte Antwort

6
P philipthebucket@piefed.social

I feel like at some point it needs to be active response. Phase 1 is a teergrube type of slowness to muck up the crawlers, with warnings in the headers and response body, and then phase 2 is a DDOS in response or maybe just a drone strike and cut out the middleman. Once you've actively evading Anubis, fuckin' game on.
T This user is from outside of this forum
T This user is from outside of this forum
tin@feddit.uk

schrieb zuletzt editiert von

#15

Wasn't this called black ice in Neuromancer? Security systems that actively tried to harm the hacker?
1 Antwort Letzte Antwort

22
T turbowafflz@lemmy.world

I think the best thing to do is to not block them when they're detected but poison them instead. Feed them tons of text generated by tiny old language models, it's harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn't slow down real users, but you probably don't need much power since the scrapers probably don't really care about the speed
X This user is from outside of this forum
X This user is from outside of this forum
xthexder@l.sw0.com

schrieb zuletzt editiert von

#16

I love catching bots in tarpits, it's actually quite fun
1 Antwort Letzte Antwort

36
T traches@sh.itjust.works

These crawlers come from random people’s devices via shady apps. Each request comes from a different IP
A This user is from outside of this forum
A This user is from outside of this forum
ambitiousprocess@piefed.social

schrieb zuletzt editiert von

#17

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That's why in Codeberg's response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from "shady apps." Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people's devices to route network traffic they're unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don't want.
P 1 Antwort Letzte Antwort

13
X xxce2aab@feddit.dk

If this isn't fertile grounds for a massive class-action lawsuit, I don't know what would be.
D This user is from outside of this forum
D This user is from outside of this forum
dreadbeef@lemmy.dbzer0.com

schrieb zuletzt editiert von

#18

whos the defendent, specifically?
X 1 Antwort Letzte Antwort

4
A ambitiousprocess@piefed.social

Most of these AI crawlers are from major corporations operating out of datacenters with known IP ranges, which is why they do IP range blocks. That's why in Codeberg's response, they mention that after they fixed the configuration issue that only blocked those IP ranges on non-Anubis routes, the crawling stopped.

For example, OpenAI publishes a list of IP ranges that their crawlers can come from, and also displays user agents for each bot.

Perplexity also publishes IP ranges, but Cloudflare later found them bypassing no-crawl directives with undeclared crawlers. They did use different IPs, but not from "shady apps." Instead, they would simply rotate ASNs, and request a new IP.

The reason they do this is because it is still legal for them to do so. Rotating ASNs and IPs within that ASN is not a crime. However, maliciously utilizing apps installed on people's devices to route network traffic they're unaware of is. It also carries much higher latency, and could even allow for man-in-the-middle attacks, which they clearly don't want.
P This user is from outside of this forum
P This user is from outside of this forum
philipthebucket@piefed.social

schrieb zuletzt editiert von

#19

Honestly, man, I get what you're saying, but also at some point all that stuff just becomes someone else's problem.

This is what people forget about the social contract: It goes both ways, it was an agreement for the benefit of all. The old way was that if you had a problem with someone, you showed up at their house with a bat / with some friends. That wasn't really the way, and so we arrived at this deal where no one had to do that, but then people always start to fuck over other people involved in the system thinking that that "no one will show up at my place with a bat, whatever I do" arrangement is a law of nature. It's not.
1 Antwort Letzte Antwort

10
S sufferingsteve@feddit.nu

There once was a dream of the semantic web, also known as web2. The semantic web could have enabled easy to ingest information of webpages, removing soo much of the computation required to get the information. Thus preventing much of the AI crawling cpu overhead.

What we got as web2 instead was social media. Destroying facts and making people depressed at a newer before seen rate.

Web3 was about enabling us to securely transfer value between people digitally and without middlemen.

What crypto gave us was fraud, expensive jpgs and scams. The term web is now even so eroded that it has lost much of its meaning. The information age gave way for the misinformation age, where everything is fake.
M This user is from outside of this forum
M This user is from outside of this forum
marshezezz@lemmy.blahaj.zone

schrieb zuletzt editiert von

#20

Capitalism is grand, innit. Wait, not grand, I meant to say cancer
1 Antwort Letzte Antwort

41
N nuxcom_90percent@lemmy.zip

Yes. A nonprofit organization in Germany is going to be launching drone strikes globally. That is totally a better world.

Its also important to understand that a significant chunk of these botnets are just normal people with viruses/compromised machines. And the fastest way to launch a DDOS attack is to... rent the same botnet from the same blackhat org to attack itself. And while that would be funny, I would also rather orgs I donate to not giving that money to blackhat orgs. But that is just me.
B This user is from outside of this forum
B This user is from outside of this forum
bleistift2@sopuli.xyz

schrieb zuletzt editiert von bleistift2@sopuli.xyz

#21

https://en.wikipedia.org/wiki/Sarcasm, or maybe https://en.wikipedia.org/wiki/Hyperbole
1 Antwort Letzte Antwort

7
T turbowafflz@lemmy.world

I think the best thing to do is to not block them when they're detected but poison them instead. Feed them tons of text generated by tiny old language models, it's harder to detect and also messes up their training and makes the models less reliable. Of course you would want to do that on a separate server so it doesn't slow down real users, but you probably don't need much power since the scrapers probably don't really care about the speed
3 This user is from outside of this forum
3 This user is from outside of this forum
31ank@ani.social

schrieb zuletzt editiert von 31ank@ani.social

#22

Some guy also used zip bombs against AI crawlers, don't know if it still works. Link to the lemmy post
1 Antwort Letzte Antwort

20
P pro@programming.dev

cross-posted from: https://programming.dev/post/35852706

Source.
Z This user is from outside of this forum
Z This user is from outside of this forum
zifk@sh.itjust.works

schrieb zuletzt editiert von

#23

Anubis isn't supposed to be hard to avoid, but expensive to avoid. Not really surprised that a big company might be willing to throw a bunch of cash at it.
S R 2 Antworten Letzte Antwort

51
D dreadbeef@lemmy.dbzer0.com

whos the defendent, specifically?
X This user is from outside of this forum
X This user is from outside of this forum
xxce2aab@feddit.dk

schrieb zuletzt editiert von

#24

No, that's a good point. We all bloody well know there isn't a single provider of LLM's that aren't sucking the entire Internet dry while gleefully ignoring robots.txt and expecting everybody else to pay the bill on their behalf, but the AI providers are getting really good at using other people IPs both to mask their identity and to evade blacklists, which is yet another abusive behavior.

But that's beside your point. So forget the class-action lawsuit in favor of the relevant Ombudsman.

Either way, this cannot go on. Donation-driven open source projects are being driven into the ground by exploding bandwidth and hosting costs, people are being forced to deploy tools like Anubis that eats additional resources - including the resources of every legitimate user. The cumulative damage this is doing is no joke.
1 Antwort Letzte Antwort

6
P prodigalfrog@slrpnk.net

That's actually a major plot point in Cyberpunk 2077. There's thousands of rogue AI's on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.
K This user is from outside of this forum
K This user is from outside of this forum
klear@quokk.au

schrieb zuletzt editiert von

#25

The game is an excellent documentary.
1 Antwort Letzte Antwort

2
P prodigalfrog@slrpnk.net

That's actually a major plot point in Cyberpunk 2077. There's thousands of rogue AI's on the net that are constantly bombarding a giant firewall protecting the main net and everything connected to it from being taken over by the AI.
T This user is from outside of this forum
T This user is from outside of this forum
track_shovel@slrpnk.net

schrieb zuletzt editiert von

#26

Unrelated, but I saw this headline, and could hear both you and squidward swearing from here.
1 Antwort Letzte Antwort

1

Anmelden zum Antworten

G

Butter made from carbon tastes like the real thing, gets backing from Bill Gates
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
135

1

355 Stimmen

135 Beiträge

196 Aufrufe

S

Storing power is expensive and many energy storage techniques require a lot of resources to produce. The more we move toward solar generation, the more we should plan on being opportunistic with energy when it is plentiful For example, electrolysis isn't the most efficient way to store power, but if energy is cheap, it may be better on net to do it opportunistically when there's excess energy and use that hydrogen for things like producing artificial butter (and perhaps fuel mobile equipment like forklifts and delivery trucks). Cows aren't particularly efficient at turning biomass into human food. There's a ton of waste in the process, and they need a lot of space. A factory doesn't need to sustain life of an organism, it just needs to turn one set of compounds into another. Maybe it's not there now, but getting it there will be a lot easier than genetically engineering a much better cow.
Q

Spotify fans threaten to return to piracy as music streamer introduces new face-scanning age checks in the UK
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
426

1

2k Stimmen

426 Beiträge

2k Aufrufe

B

"Disinformation" Ahhh ok I see now. So who do you think should be in charge of that? Who do we make "the good guys" that get to tell me the information I need to see. USA, CCP, Israel, Russia, EU? Fuck that shit. I don't trust a god damn one of them and if you do then your opinions on regulating the internet don't mean shit to me.
A

Airlines urge senators to reject bill limiting facial recognition
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
12

105 Stimmen

12 Beiträge

28 Aufrufe

H

Part of the reason it's so fast is they have the passenger manifest already. So they start the search checking against the hundreds of people that just arrived. Instead off the much larger overall database.
A

Doge reportedly using AI tool to create ‘delete list’ of federal regulations
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
43

1

351 Stimmen

43 Beiträge

482 Aufrufe

S

Yup. The greatest danger of AI, is the corporations and governments having sole control of it. That is why it is important for ordinary people to not reject AI usage, but to make it cheap and common enough that no one has to rely on the elite for access. Be it guns, food, shelter, or knowledge, no one should have a monopoly. That is just asking to be abused.
D

OpenAI launches personal assistant capable of controlling files and web browsers
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
19

1

46 Stimmen

19 Beiträge

364 Aufrufe

D

I have the same battle. The thing I like is that blocking just makes them more aggressive, clicking everything costs them actual money.
P

Sleeping beauty bitcoin wallets wake up after 14 years to the tune of $2 billion
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
47

300 Stimmen

47 Beiträge

527 Aufrufe

T

I worked in a bank for a bit. Literally any transaction that's large and unusual for the account will be flagged. Also people do bonkers things with their money for the stupidest reasons all the time so all that one has to do if they're making large transactions is be prepared to talk to the bank and explain what's going on. Unless of course you are handling money in relation to organized crime, in which case you were fucked the moment the money touched the banking system
P

Pervasive Surveillance of People is Being Used to Access, Monetise, Coerce, and Control: Computer Vision Research Feeds Surveillance Tech as Patent Links Spike 5×
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

40 Stimmen

1 Beiträge

20 Aufrufe

Niemand hat geantwortet
P

OpenAI sees human interaction as a competitor to ChatGPT's super assistant ambitions
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
27

1

50 Stimmen

27 Beiträge

310 Aufrufe

S

Brother I live in western Europe and of the 6 supermarkets in my smallish city, 4 offer the handscanner. It's incredibly common here, and very convenient.