linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Technology

126 Beiträge 69 Kommentatoren 0 Aufrufe

J jqubed@lemmy.world

I think in Cloudflare’s case the free tier website owners are more an example of just giving the users a limited product in hopes of enticing them to upgrade to the paid product with more features and better performance. Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot, but I don’t think that really gives the same ROI like Facebook or other services extract from their “free” services where the users are the actual product.
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#81

Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot

It lets them get a very wide base to test products against, which in and of itself, is a huge benefit. They can test out far more edge-cases than anyone else in the industry at the moment.
1 Antwort Letzte Antwort

0
F fauxliving@lemmy.world

I for one am rooting for perplexity to go out of business forever.

Yeah, I know.

You’re engaging in motivated reasoning. That’s why you’re saying irrational things, because you’re working backwards from a conclusion (AI bad).
P This user is from outside of this forum
P This user is from outside of this forum
pr06lefs@lemmy.ml

schrieb zuletzt editiert von

#82

I don't see how categorically blocking non-human traffic is irrational given the current environment of AI scanning. And what's rational about demanding cloudflare distinguish between the 'good guy' AI and 'bad guy' AI without proposing any methodology for doing so.
F 1 Antwort Letzte Antwort

1
D dogiedog64@lemmy.world

It's more than simply astonishing, it's mind-blowingly bonkers how much money they have to burn to see ANY amount of return. You think a normal company is bad, blowing a few thousand bucks on materials, equipment, and labor per day in order to make a few bucks revenue (not profit)? AI companies have to blow HUNDREDS OF BILLIONS on massive data center complexes in order to train their bots, and then the energy cost and water cost of running them adds a couple more million a day. ALL so they can make negative hundreds of dollars on every prompt you can dream of.

The ONLY reason AI firms are still a thing in the current tech tree is because Techbros everywhere have convinced the uberwealthy VC firms that AGI is RIGHT AROUND THE CORNER, and will save them SO much money on labor and efficiency that it'll all be worth it in permanent, pure, infinite profit. If that sounds like too much of a pipe dream to be realistic, congratulations, you're a sane and rational human being.
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#83

It’s more than simply astonishing, it’s mind-blowingly bonkers how much money they have to burn to see ANY amount of return

See, that's the trick, and it's used by LOADS of startups:

You don't actually have to see a return... You just have to have a good story showing there MAY be a GIANT return. The founders collect enormous salaries (Funded by VC dollars, not their own), they burn through the money to create more illusion, then ask for more, then burn through that, foretelling of the coming days when the money is just coming!

Meanwhile, just before it's "projected" to become insanely profitable, they sell out to someone, walk away with a giant check, and the product evaporates.
1 Antwort Letzte Antwort

3
D davriellelouna@lemmy.world

This post did not contain any content.
F This user is from outside of this forum
F This user is from outside of this forum
fossilesque@mander.xyz

schrieb zuletzt editiert von

#84

I hate that these bots ruin my read it later app.
1 Antwort Letzte Antwort

4
S spankmonkey@lemmy.world

The AI doesn't just do a web search and display a page, in grabs the search results and scrapes multiple pages far faster than a person could.

It doesn't matter whether a human initiated it when the load on the website is far, far higher and more intrusive in a shorter period of time with AI compared to a human doing a web search and reading the cobtent themselves.
F This user is from outside of this forum
F This user is from outside of this forum
fauxliving@lemmy.world

schrieb zuletzt editiert von

#85

It creates web requests faster than a human could. It does not create web requests as fast as possible like a crawler does.

Websites can handle a lot of human user traffic, even if some human users are making 5x the requests of other users due to using automation tools (like LLM summarization).

A website cannot handle a single bot which can, by itself, can generate tens of millions of times as much traffic as a human.

Cloudflare’s method of detecting bots is to attempt to fingerprint the browser and user behavior to detect automations which are usually run in environments that can’t render the content. They did this because, until now, users did not use automation tools so detecting and blocking automation tools was a way to get most of the bots.

Now, users do use automation tools and so this method of classification is dated and misclassifying human generated traffic.
1 Antwort Letzte Antwort

0
F fauxliving@lemmy.world

It isn’t opt in.

You can block all bot page scraping, and also block user initiated AI tools or you can block no traffic.

There isn’t an option to block bot page scraping but allow user initiated AI tools.

Because, as the article points out, Cloudflare is not able to distinguish between the two
U This user is from outside of this forum
U This user is from outside of this forum
ubergeek@lemmy.today

schrieb zuletzt editiert von

#86

Thats not true, I just viewed my panel in CF, and Perplexity is an optional block, which by default is off.
F 1 Antwort Letzte Antwort

0
F fauxliving@lemmy.world

The amount of people just reacting to the headline in the comments on these kinds of articles is always surprising.

Your browser acts as an agent too, you don’t manually visit every script link, image source and CSS file. Everyone has experienced how annoying it is to have your browser be targeted by Cloudflare.

There’s a pretty major difference between a human user loading a page and having it summarized and a bot that is scraping 1500 pages/second.

Cheering for Cloudflare to be the arbiter of what technologies are allowed is incredibly short sighted. They exist to provide their clients with services, including bot mitigation. But a user initiated operation isn’t the same as a bot.

Which is the point of the article and the article’s title.

It isn’t clear why OP had to alter the headline to bait the anti-ai crowd.
U This user is from outside of this forum
U This user is from outside of this forum
unpossum@sh.itjust.works

schrieb zuletzt editiert von

#87

Thank you for trying to fight the irrational anti-AI brainrot on lemmy! It’s probably a lost cause, but your efforts are appreciated
F 1 Antwort Letzte Antwort

2
G glitchvid@lemmy.world

When sites put challenges like Anubis or other measures to authenticate that the viewer isn't a robot, and scrapers then employ measures to thwart that authentication (via spoofing or other means) I think that's a reasonable violation of the CFAA in spirit — especially since these mass scraping activities are getting attention for the damage they are causing to site operators (another factor in the CFAA, and one that would promote this to felony activity.)

The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.
T This user is from outside of this forum
T This user is from outside of this forum
tomalley8342@lemmy.world

schrieb zuletzt editiert von

#88

Nah, that would also mean using Newpipe, YoutubeDL, Revanced, and Tachiyomi would be a crime, and it would only take the re-introduction of WEI to extend that criminalization to the rest of the web ecosystem. It would be extremely shortsighted and foolish of me to cheer on the criminalization of user spoofing and browser automation because of this.
1 Antwort Letzte Antwort

2
D davriellelouna@lemmy.world

This post did not contain any content.
E This user is from outside of this forum
E This user is from outside of this forum
etherwhack@lemmy.world

schrieb zuletzt editiert von

#89
1 Antwort Letzte Antwort

24
L lividweasel@lemmy.world

…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.
S This user is from outside of this forum
S This user is from outside of this forum
spankmonkey@lemmy.world

schrieb zuletzt editiert von

#90

They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.

It is also horribly inefficient and works like a small scale DDOS attack.
1 Antwort Letzte Antwort

0
J jqubed@lemmy.world

I think in Cloudflare’s case the free tier website owners are more an example of just giving the users a limited product in hopes of enticing them to upgrade to the paid product with more features and better performance. Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot, but I don’t think that really gives the same ROI like Facebook or other services extract from their “free” services where the users are the actual product.
I This user is from outside of this forum
I This user is from outside of this forum
interdimensionalmeme@lemmy.ml

schrieb zuletzt editiert von

#91

It's a spectrum and Cloudflare has snuffed out or gobbled up quite everyone they need to before the end the honeymoon phase.
1 Antwort Letzte Antwort

0
P pr06lefs@lemmy.ml

I don't see how categorically blocking non-human traffic is irrational given the current environment of AI scanning. And what's rational about demanding cloudflare distinguish between the 'good guy' AI and 'bad guy' AI without proposing any methodology for doing so.
F This user is from outside of this forum
F This user is from outside of this forum
fauxliving@lemmy.world

schrieb zuletzt editiert von

#92

It is blocking human traffic, that’s the entire premise of the article.

Attempting to say that this is non-human traffic makes no sense if you understand how a browser works. When you load a website your browser, acting as an agent, does a lot of tasks for you and generates a bunch of web requests across multiple hosts.

Your browser downloads the HTML from the website, it parses the contents of that file for image, script and CSS links, it retrieves them from the various websites which host them, it interprets the JavaScript and makes any web requests based on that. Often the scripting has a user constantly sending requests to a website in order to update the content (like using web based email).

All of this is automated and done on your behalf. But you wouldn’t classify this traffic as non-human because a person told the browser to do that task and the task resulted in a flurry of web requests and processing on behalf of the user.

Summarization is just another task, which is requested by a human.

The primary difference, and why it is incorrectly classified, is because the summarization tools use a stripped down browser. It doesn’t need JavaScript to be rendered or CSS to change the background color so it doesn’t waste resources on rendering that stuff.

Cloudflare detects this kind of environment, one that doesn’t fully render a page, and assumes that it is a web scraper. This used to be a good way to detect scraping because the average user didn’t use web automation tools and scrapers did.

Regular users do use automation tools now, so detecting automation doesn’t guarantee that the agent is a scraper bot.

The point of the article is that their heuristics doesn’t work anymore because users use automation tools in a manner that doesn’t generate tens of millions of requests per second and overwhelm servers and so it shouldn’t classify them the same way.

The point of Cloudflare’s bot blocking is to prevent a single user from overwhelming a site’s resources. These tools don’t do that. Go use any search summarization tool and see for yourself, it usually grabs one page from each source. That kind of traffic uses less resources than a human user (because it only grabs static content).
P 1 Antwort Letzte Antwort

0
U unpossum@sh.itjust.works

Thank you for trying to fight the irrational anti-AI brainrot on lemmy! It’s probably a lost cause, but your efforts are appreciated
F This user is from outside of this forum
F This user is from outside of this forum
fauxliving@lemmy.world

schrieb zuletzt editiert von fauxliving@lemmy.world

#93

It’s an uphill battle. Lots of motivated reasoning and bad faith arguments

e: looks like Cloudflare is adding this distinction in their control panel. So it seems like they, too disagree with the brain rot. Source: https://lemmy.world/post/34677771/18880370
1 Antwort Letzte Antwort

0
U ubergeek@lemmy.today

Thats not true, I just viewed my panel in CF, and Perplexity is an optional block, which by default is off.
F This user is from outside of this forum
F This user is from outside of this forum
fauxliving@lemmy.world

schrieb zuletzt editiert von

#94

They must be A/B testing a new feature then, it’s not on mine
U 1 Antwort Letzte Antwort

0
D davriellelouna@lemmy.world

This post did not contain any content.
G This user is from outside of this forum
G This user is from outside of this forum
gravitas_deficiency@sh.itjust.works

schrieb zuletzt editiert von

#95

good, that means it’s working

I’m gonna be frustrated (though not surprised) if the response is anything other than this.
1 Antwort Letzte Antwort

13
E encryptkeeper@lemmy.world

That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.
C This user is from outside of this forum
C This user is from outside of this forum
cm0002@piefed.world

schrieb zuletzt editiert von

#96

You say, just as news breaks that the top German court has over turned a decision that declared "AD blocking isn't piracy"
E 1 Antwort Letzte Antwort

2
C cm0002@piefed.world

You say, just as news breaks that the top German court has over turned a decision that declared "AD blocking isn't piracy"
E This user is from outside of this forum
E This user is from outside of this forum
encryptkeeper@lemmy.world

schrieb zuletzt editiert von

#97

Unauthorized access into a computer system and “Piracy” are two very different things.
C 1 Antwort Letzte Antwort

0
E encryptkeeper@lemmy.world

Unauthorized access into a computer system and “Piracy” are two very different things.
C This user is from outside of this forum
C This user is from outside of this forum
cm0002@piefed.world

schrieb zuletzt editiert von

#98

Please instruct me on how I go to the timeline where the legal system always makes decisions based on logic, reasoning, evidence and fairness and not...the opposite...of all those things

You have a lot of trust placed in the courts to actually do the right thing
E 1 Antwort Letzte Antwort

1
F fauxliving@lemmy.world

It is blocking human traffic, that’s the entire premise of the article.

Attempting to say that this is non-human traffic makes no sense if you understand how a browser works. When you load a website your browser, acting as an agent, does a lot of tasks for you and generates a bunch of web requests across multiple hosts.

Your browser downloads the HTML from the website, it parses the contents of that file for image, script and CSS links, it retrieves them from the various websites which host them, it interprets the JavaScript and makes any web requests based on that. Often the scripting has a user constantly sending requests to a website in order to update the content (like using web based email).

All of this is automated and done on your behalf. But you wouldn’t classify this traffic as non-human because a person told the browser to do that task and the task resulted in a flurry of web requests and processing on behalf of the user.

Summarization is just another task, which is requested by a human.

The primary difference, and why it is incorrectly classified, is because the summarization tools use a stripped down browser. It doesn’t need JavaScript to be rendered or CSS to change the background color so it doesn’t waste resources on rendering that stuff.

Cloudflare detects this kind of environment, one that doesn’t fully render a page, and assumes that it is a web scraper. This used to be a good way to detect scraping because the average user didn’t use web automation tools and scrapers did.

Regular users do use automation tools now, so detecting automation doesn’t guarantee that the agent is a scraper bot.

The point of the article is that their heuristics doesn’t work anymore because users use automation tools in a manner that doesn’t generate tens of millions of requests per second and overwhelm servers and so it shouldn’t classify them the same way.

The point of Cloudflare’s bot blocking is to prevent a single user from overwhelming a site’s resources. These tools don’t do that. Go use any search summarization tool and see for yourself, it usually grabs one page from each source. That kind of traffic uses less resources than a human user (because it only grabs static content).
P This user is from outside of this forum
P This user is from outside of this forum
pr06lefs@lemmy.ml

schrieb zuletzt editiert von

#99

so how would cloudflare tell the difference between the good 'stripped down' queries and the bad? still not hearing how that is supposed to work. if there's no way to tell the difference, the baby will be thrown out with the bathwater, and I can't blame them.
F 1 Antwort Letzte Antwort

0
E encryptkeeper@lemmy.world

I can’t get over their CEO that looks like a nine year old. Not sure what it is about him
D This user is from outside of this forum
D This user is from outside of this forum
darkenfolk@sh.itjust.works

schrieb zuletzt editiert von

#100

I think it's the beard, it makes his cheeks look puffed up a bit. His whole expression kinda looks like a grouchy toddler.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

A

Government documents show police disabling AI oversight tools
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
10

1

209 Stimmen

10 Beiträge

10 Aufrufe

V

More machine than man? Because I'm pretty sure RoboCop didn't die to an even space wizard frying him with lightning while redeeming himself and saving his son. But I've never watched past RoboCop 2 so IDFK.
I

LeBron James' Lawyers Send Cease-and-Desist to AI Company Making Pregnant Videos of Him
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
3

94 Stimmen

3 Beiträge

30 Aufrufe

G

[image: 3e682d45-3362-4256-8627-112416472d75.webp] Pfft hahahaha
Y

Why Every University Needs a Robust Library Software
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

5 Stimmen

2 Beiträge

38 Aufrufe

D

What are you hoping to accomplish by pasting AI generated word soup here?
D

The End of Windows 10: a toolkit for community repair groups - The Restart Project
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

1

75 Stimmen

2 Beiträge

38 Aufrufe

N

This is beautiful - and a noble service for humanity. Thank you for posting this, OP!
P

The real winners of the AI Race: Microsoft, Amazon, Google and Nvidia
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
5

1

33 Stimmen

5 Beiträge

69 Aufrufe

D

If it's so good then why does deepseek-qwen slap
O

Tracing the Honda Acty’s Evolution: Generation by Generation
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

0 Stimmen

1 Beiträge

20 Aufrufe

Niemand hat geantwortet
R

Zero-day: Bluetooth gap turns millions of headphones into listening stations
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
123

1

527 Stimmen

123 Beiträge

2k Aufrufe

B

I'm not saying to waste space... but when manufacturers start a pissing match among themselves and say that it's because it's what the customers want, we end up with shit. Why does anyone need a screen that curves around the edge of the phone? What purpose does this serve? Who actually asked for this? I would give up some of my screen area to have forward facing speakers. I want a thicker phone that has better battery life. I also want to be able to swap out my battery. Oh, and I don't want the entire thing encased in glass. If we're so concerned about phone size then they should stop designing them so that a case is required.
A

Palantir partners to develop AI software for nuclear construction
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

0 Stimmen

1 Beiträge

21 Aufrufe

Niemand hat geantwortet