linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Technology

211 Beiträge 116 Kommentatoren 12 Aufrufe

B brunbrun6766@lemmy.world

Step 1, SOMEHOW find a more punchable face than Altman
T This user is from outside of this forum
T This user is from outside of this forum
tollana1234567@lemmy.today

schrieb zuletzt editiert von

#142

put META android zuckerberg on or mechahitler musk.
U 1 Antwort Letzte Antwort

10
P pennomi@lemmy.world

On the flip side, most websites are so ad-ridden these days a reader mode or other summary tool is almost required for normal browsing. Not saying that AI is the right move, but I can understand not wanting to visit the actual page any more.
T This user is from outside of this forum
T This user is from outside of this forum
tollana1234567@lemmy.today

schrieb zuletzt editiert von

#143

i put ublock origin, or another adblock on all my browsers, including phone ones and forks.
1 Antwort Letzte Antwort

0
K kopasz7@sh.itjust.works

Search engines been going relatively fine for decades now. But the crawlers from AI companies basically DDOS hosts in comparison, sending so many requests in such a short interval. Crawling dynamic links as well that are expensive to render compared to a static page, ignoring the robots.txt entirely, or even using it discover unlinked pages.

Servers have finite resources, especially self hosted sites, while AI companies have disproportinately more at their disposal, easily grinding other systems to a halt by overwhelming them with requests.
T This user is from outside of this forum
T This user is from outside of this forum
tollana1234567@lemmy.today

schrieb zuletzt editiert von

#144

that explains why cloudflare keeps asking your abot or not, making you do that captcha.
1 Antwort Letzte Antwort

0
V very_well_lost@lemmy.world

A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

I think it's also worth pointing out that all of the big AI companies are currently burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be pretty painful for their already-tortured bottom line (good).
T This user is from outside of this forum
T This user is from outside of this forum
tollana1234567@lemmy.today

schrieb zuletzt editiert von

#145

they already said they wernt profitable, they are trying to keep on life support til the VC funds run out.
1 Antwort Letzte Antwort

0
P pressanykeynow@lemmy.world

That would be terrible for a lot of people as they are the only company providing such services that doesn't charge for traffic.
I This user is from outside of this forum
I This user is from outside of this forum
int32@lemmy.dbzer0.com

schrieb zuletzt editiert von int32@lemmy.dbzer0.com

#146

They can use web.archive.org as a cdn(I do that to cloudflare websites). But honestly, cloudflare or not, the internet is broken.
P T 2 Antworten Letzte Antwort

2
L lividweasel@lemmy.world

…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.
R This user is from outside of this forum
R This user is from outside of this forum
rdri@lemmy.world

schrieb zuletzt editiert von

#147

First we complain that AI steals and trains on our data. Then we complain when it doesn't train. Cool.
U 1 Antwort Letzte Antwort

0
D davriellelouna@lemmy.world

This post did not contain any content.
K This user is from outside of this forum
K This user is from outside of this forum
kreskin@lemmy.world

schrieb zuletzt editiert von kreskin@lemmy.world

#148

they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.
D 5 2 Antworten Letzte Antwort

11
D demdaru@lemmy.world

Ehhhh, you are gaining access to content due to assumption you are going to interact with ads and thus, bring revenue to the person and/or company producing said content. If you block ads, you remove authorisation brought to you by ads.
G This user is from outside of this forum
G This user is from outside of this forum
gian@lemmy.grys.it

schrieb zuletzt editiert von

#149

Carefull, this way even not looking at an ads positioned at the bottom of the page (or anyway not visible without scrolling) would mean to remove authorisation brought to you by ads.
1 Antwort Letzte Antwort

0
D davriellelouna@lemmy.world

This post did not contain any content.
P This user is from outside of this forum
P This user is from outside of this forum
poopkins@lemmy.world

schrieb zuletzt editiert von poopkins@lemmy.world

#150

I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)

So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity's are awful for the web.

(Edited for clarity)
I 1 Antwort Letzte Antwort

8
I int32@lemmy.dbzer0.com

They can use web.archive.org as a cdn(I do that to cloudflare websites). But honestly, cloudflare or not, the internet is broken.
P This user is from outside of this forum
P This user is from outside of this forum
pressanykeynow@lemmy.world

schrieb zuletzt editiert von

#151

Can you explain please? How can I use archive.org as a cdn for my website?
I 1 Antwort Letzte Antwort

2
P pyre@lemmy.world

yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.
O This user is from outside of this forum
O This user is from outside of this forum
oppy1984@lemdro.id

schrieb zuletzt editiert von

#152

I'm out of the loop, what's wrong with cloud flare?
U 1 Antwort Letzte Antwort

1
S spankmonkey@lemmy.world

Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.
F This user is from outside of this forum
F This user is from outside of this forum
flux@lemmy.ml

schrieb zuletzt editiert von

#153

This is not about training data, though.

Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

Though technically making all this happen flawlessly is quite a big task.
S 1 Antwort Letzte Antwort

0
D davriellelouna@lemmy.world

This post did not contain any content.
A This user is from outside of this forum
A This user is from outside of this forum
amberskin@europe.pub

schrieb zuletzt editiert von

#154

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?
U D 2 Antworten Letzte Antwort

52
P poopkins@lemmy.world

I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)

So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like ~~this~~ Perplexity's are awful for the web.

(Edited for clarity)
I This user is from outside of this forum
I This user is from outside of this forum
iphtashufitz@lemmy.world

schrieb zuletzt editiert von

#155

I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.

We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.
P 1 Antwort Letzte Antwort

5
F flux@lemmy.ml

This is not about training data, though.

Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

Though technically making all this happen flawlessly is quite a big task.
S This user is from outside of this forum
S This user is from outside of this forum
spankmonkey@lemmy.world

schrieb zuletzt editiert von

#156

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

They are one of the sources!

The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

Scraping once extensively and scraping a bit less but far more frequently have similar impacts.
F 1 Antwort Letzte Antwort

0
S spankmonkey@lemmy.world

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

They are one of the sources!

The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

Scraping once extensively and scraping a bit less but far more frequently have similar impacts.
F This user is from outside of this forum
F This user is from outside of this forum
flux@lemmy.ml

schrieb zuletzt editiert von

#157

When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won't retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn't a DDoS attack.

Constructing the training material in the first place is a different matter, but if you're asking about fresh events or new APIs, the training data just doesn't cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.
1 Antwort Letzte Antwort

0
D davriellelouna@lemmy.world

This post did not contain any content.
D This user is from outside of this forum
D This user is from outside of this forum
drmoose@lemmy.world

schrieb zuletzt editiert von

#158

It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.
B D D C 4 Antworten Letzte Antwort

7
K kreskin@lemmy.world

they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.
D This user is from outside of this forum
D This user is from outside of this forum
drmoose@lemmy.world

schrieb zuletzt editiert von

#159

Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It's all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.
T 1 Antwort Letzte Antwort

13
K kissaki@feddit.org

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?
D This user is from outside of this forum
D This user is from outside of this forum
drmoose@lemmy.world

schrieb zuletzt editiert von

#160

Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.
1 Antwort Letzte Antwort

2
D drmoose@lemmy.world

It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.
B This user is from outside of this forum
B This user is from outside of this forum
baronofclubs@lemmy.world

schrieb zuletzt editiert von

#161

omg ur a hacker

Did you mean Edge on Windows? 'Cause if so, welcome in!
1 Antwort Letzte Antwort

1

Anmelden zum Antworten

A

Twitter founder Jack Dorsey pumps $10 million into a nonprofit to build Nostr-based social media apps
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology foss technology twitter socialmedia nostr
60

1

247 Stimmen

60 Beiträge

1k Aufrufe

N

Nostr really isn't about the platform, it's about the simple, platform-agnostic keypair setup that identifies you.
K

The Internet is for Extremism - by Jeremiah Johnson
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

1

84 Stimmen

9 Beiträge

129 Aufrufe

L

I've been saying this for years. glad someone wrote about it.
D

We’re proud to announce GIMP 3.1.2, the first development version of what will become GIMP 3.2!
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

182 Stimmen

1 Beiträge

17 Aufrufe

Niemand hat geantwortet
D

Pope Leo urges politicians to respond to challenges posed by AI
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

0 Stimmen

1 Beiträge

20 Aufrufe

Niemand hat geantwortet
P

Russian Lawmakers Authorize Creation Of National Messaging Service
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
13

1

34 Stimmen

13 Beiträge

143 Aufrufe

C

Are there substantial numbers of Russians who seriously wouldn't be wise to this?
P

Paradromics implanted and removed its Connexus brain implant in a patient during epilepsy surgery, a first for the Neuralink rival
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

6 Stimmen

1 Beiträge

21 Aufrufe

Niemand hat geantwortet
Q

New Cars Don't All Come With Dipsticks Anymore, Here's Why
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
22

1

2 Stimmen

22 Beiträge

301 Aufrufe

L

The U660F transmission in my wife's 2015 Highlander doesn't have a dipstick. Luckily that transmission is solid and easy to service anyway, you just need a skinny funnel to fill it.
W

The Enshitification of Youtube’s Full Album Playlists
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
3

1

108 Stimmen

3 Beiträge

43 Aufrufe

D

Especially when the poster does not disclose that it's AI. The perpetual Youtube rabbit hole occasionally lands on one of these for me when I leave it unsupervised, and usually you can tell from the "cover" art. But only if you're looking at it. Because if you just leave it going in the background eventually you start to realize, "Wow, this guy really tripped over the fine line between a groove and rut." Then you click on it and look: Curses! Foiled again. And golly gee, I'm sure glad Youtube took away the option to oughtright block channels. I'm sure that's a total coincidence. W/e. I'm a have-it-on-my-hard-drive kind of bird. Yt-dlp is your friend. Just use it to nab whatever it is you actually want and let your own media player decide how to shuffle and present it. This works great for big name commercial music as well, whereupon the record labels are inevitably dumb enough to post songs and albums in their entirety right there you Youtube. Who even needs piracy sites at that rate? Yoink!