linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Technology

223 Beiträge 121 Kommentatoren 14 Aufrufe

D davriellelouna@lemmy.world

This post did not contain any content.
P This user is from outside of this forum
P This user is from outside of this forum
poopkins@lemmy.world

schrieb zuletzt editiert von poopkins@lemmy.world

#150

I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)

So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity's are awful for the web.

(Edited for clarity)
I 1 Antwort Letzte Antwort

8
I int32@lemmy.dbzer0.com

They can use web.archive.org as a cdn(I do that to cloudflare websites). But honestly, cloudflare or not, the internet is broken.
P This user is from outside of this forum
P This user is from outside of this forum
pressanykeynow@lemmy.world

schrieb zuletzt editiert von

#151

Can you explain please? How can I use archive.org as a cdn for my website?
I 1 Antwort Letzte Antwort

2
P pyre@lemmy.world

yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.
O This user is from outside of this forum
O This user is from outside of this forum
oppy1984@lemdro.id

schrieb zuletzt editiert von

#152

I'm out of the loop, what's wrong with cloud flare?
U 1 Antwort Letzte Antwort

1
S spankmonkey@lemmy.world

Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.
F This user is from outside of this forum
F This user is from outside of this forum
flux@lemmy.ml

schrieb zuletzt editiert von

#153

This is not about training data, though.

Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

Though technically making all this happen flawlessly is quite a big task.
S 1 Antwort Letzte Antwort

0
D davriellelouna@lemmy.world

This post did not contain any content.
A This user is from outside of this forum
A This user is from outside of this forum
amberskin@europe.pub

schrieb zuletzt editiert von

#154

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?
U D 2 Antworten Letzte Antwort

57
P poopkins@lemmy.world

I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)

So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity's are awful for the web.

(Edited for clarity)
I This user is from outside of this forum
I This user is from outside of this forum
iphtashufitz@lemmy.world

schrieb zuletzt editiert von

#155

I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.

We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.
P 1 Antwort Letzte Antwort

5
F flux@lemmy.ml

This is not about training data, though.

Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

Though technically making all this happen flawlessly is quite a big task.
S This user is from outside of this forum
S This user is from outside of this forum
spankmonkey@lemmy.world

schrieb zuletzt editiert von

#156

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

They are one of the sources!

The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

Scraping once extensively and scraping a bit less but far more frequently have similar impacts.
F 1 Antwort Letzte Antwort

0
S spankmonkey@lemmy.world

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

They are one of the sources!

The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

Scraping once extensively and scraping a bit less but far more frequently have similar impacts.
F This user is from outside of this forum
F This user is from outside of this forum
flux@lemmy.ml

schrieb zuletzt editiert von

#157

When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won't retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn't a DDoS attack.

Constructing the training material in the first place is a different matter, but if you're asking about fresh events or new APIs, the training data just doesn't cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.
1 Antwort Letzte Antwort

0
D davriellelouna@lemmy.world

This post did not contain any content.
D This user is from outside of this forum
D This user is from outside of this forum
drmoose@lemmy.world

schrieb zuletzt editiert von

#158

It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.
B D D C 4 Antworten Letzte Antwort

7
K kreskin@lemmy.world

they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.
D This user is from outside of this forum
D This user is from outside of this forum
drmoose@lemmy.world

schrieb zuletzt editiert von

#159

Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It's all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.
T 1 Antwort Letzte Antwort

13
K kissaki@feddit.org

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?
D This user is from outside of this forum
D This user is from outside of this forum
drmoose@lemmy.world

schrieb zuletzt editiert von

#160

Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.
1 Antwort Letzte Antwort

2
D drmoose@lemmy.world

It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.
B This user is from outside of this forum
B This user is from outside of this forum
baronofclubs@lemmy.world

schrieb zuletzt editiert von

#161

omg ur a hacker

Did you mean Edge on Windows? 'Cause if so, welcome in!
1 Antwort Letzte Antwort

1
D drmoose@lemmy.world

It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.
D This user is from outside of this forum
D This user is from outside of this forum
dodos@lemmy.world

schrieb zuletzt editiert von

#162

I'm on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.
Y D J 3 Antworten Letzte Antwort

13
K kissaki@feddit.org

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#163

And I'm assuming if the robots.txt state their UserAgent isn't allowed to crawl, it obeys it, right?
K 1 Antwort Letzte Antwort

6
O oppy1984@lemdro.id

I'm out of the loop, what's wrong with cloud flare?
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#164

Centralization, mostly, but also their hands-off approach to most fascist content.
J 1 Antwort Letzte Antwort

3
T tomalley8342@lemmy.world

DoS attacks are already a crime, so of course the need for some kind of solution is clear. But any proposal that gatekeeps the internet and restricts the freedoms with which the user can interact with it is no solution at all. To me, the openness of the web shouldn't be something that people just consider, or are amenable to. It should be the foundation in which all reasonable proposals should consider as a principle truth.
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#165

How "open" a website is, is up to the owner, and that's all. Unless we're talking about de-privatizing the internet as a whole, here.
T 1 Antwort Letzte Antwort

0
D dodos@lemmy.world

I'm on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.
Y This user is from outside of this forum
Y This user is from outside of this forum
yeller_king@reddthat.com

schrieb zuletzt editiert von

#166

In my case, it's usually the VPN.
1 Antwort Letzte Antwort

1
K kreskin@lemmy.world

they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.
5 This user is from outside of this forum
5 This user is from outside of this forum
5gruel@lemmy.world

schrieb zuletzt editiert von

#167

Recaptcha v2 does way more than check if the box was checked.

How does Google reCAPTCHA v2 work behind the scenes?

This post refers to Google ReCaptcha v2 (not the latest version) Recently Google introduced a simplified "captcha" verification system (video) that enables users to pass the "captcha" just by clic...

Stack Overflow (stackoverflow.com)
1 Antwort Letzte Antwort

3
K kokesh@lemmy.world

Is there some simply deployable PHP honeytrap for AI crawlers?
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#168

You could probably route all requests to your site from them, back at themselves, so they DDoS themselves, and on top off it, cost them more because their endpoint needs to process things via their LLM.
1 Antwort Letzte Antwort

0
R rdri@lemmy.world

First we complain that AI steals and trains on our data. Then we complain when it doesn't train. Cool.
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#169

I think it boils down to "consent" and "remuneration".

I run a website, that I do not consent to being accessed for LLMs. However, should LLMs use my content, I should be compensated for such use.

So, these LLM startups ignore both consent, and the idea of remuneration.

Most of these concepts have already been figured out for the purpose of law, if we consider websites much akin to real estate: Then, the typical trespass laws, compensatory usage, and hell, even eminent domain if needed ie, a city government can "take over" the boosted post feature to make sure alerts get pushed as widely and quickly as possible.
R 1 Antwort Letzte Antwort

0

Anmelden zum Antworten

C

Left to Right Programming
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
7

1

17 Stimmen

7 Beiträge

0 Aufrufe

P

I agree with you that the one liner isn't a good example, but I do prefer the "left to right" syntax shown in the article. My brain just really likes getting the information in this order: "Iterate over Collection, and for each object do Operation(object)". The cost of writing member functions for each class is a valid concern. I'm really interested in the concept of uniform function call syntax for this reason, though I haven't played around with a language that has it to get a feeling of what its downsides might be.
K

Samsung’s One UI 8 might shut down bootloader unlocking on Galaxy phones
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
41

1

137 Stimmen

41 Beiträge

288 Aufrufe

E

Yuck indeed. People tried many ways to get around it, back when I was still using an US variant Samsung Note 9, people went as far as using a leaked engineering/preproduction ROM, which can be flashed using Samsung's official tool because it does have the correct key for the locked bootloader to accept, being built and compiled by Samsung, and because it's an engineering ROM it would give you root and everything despite of the bootloader still being locked. But it was an exceptionally rare leak, and it was only meant for preproduction for a reason, it is very VERY unstable and not exactly usable for a daily driver lol So happy I am leaving all that BS from Samsung behind with my current Sony Xperia 1 VI which is bootloader-unlocked and rooted and deeply modded and truly my own device lol
D

The End of Windows 10: a toolkit for community repair groups - The Restart Project
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

1

75 Stimmen

2 Beiträge

38 Aufrufe

N

This is beautiful - and a noble service for humanity. Thank you for posting this, OP!
D

After an 11-month strike, Video game actors are voting on a new contract. Here’s what it means for AI in gaming
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

28 Stimmen

1 Beiträge

21 Aufrufe

Niemand hat geantwortet
E

Windows 11 has finally overtaken Windows 10 as the most used desktop OS
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
27

1

62 Stimmen

27 Beiträge

343 Aufrufe

D

It takes 7 seconds for the terminal to load on my brand new laptop. I'm sure there's some way to fix it, but that...just enrages me.
F

Large Language Model Performance Doubles Every 7 Months
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
53

1

100 Stimmen

53 Beiträge

663 Aufrufe

V

in yes/no type questions, 50% success rate is the absolute worst one can do. Any worse and you're just giving an inverted correct answer more than half the time
O

BSOD is dead, long live BSOD
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
14

1

56 Stimmen

14 Beiträge

143 Aufrufe

S

Right? I never click these useless links.
R

JD Vance gets suspended from Bluesky 'just 12 minutes after first post': reports
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
31

1

138 Stimmen

31 Beiträge

332 Aufrufe

S

Nobody fucking cares.