The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall
-
I think in Cloudflare’s case the free tier website owners are more an example of just giving the users a limited product in hopes of enticing them to upgrade to the paid product with more features and better performance. Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot, but I don’t think that really gives the same ROI like Facebook or other services extract from their “free” services where the users are the actual product.
Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot
It lets them get a very wide base to test products against, which in and of itself, is a huge benefit. They can test out far more edge-cases than anyone else in the industry at the moment.
-
I for one am rooting for perplexity to go out of business forever.
Yeah, I know.
You’re engaging in motivated reasoning. That’s why you’re saying irrational things, because you’re working backwards from a conclusion (AI bad).
I don't see how categorically blocking non-human traffic is irrational given the current environment of AI scanning. And what's rational about demanding cloudflare distinguish between the 'good guy' AI and 'bad guy' AI without proposing any methodology for doing so.
-
It's more than simply astonishing, it's mind-blowingly bonkers how much money they have to burn to see ANY amount of return. You think a normal company is bad, blowing a few thousand bucks on materials, equipment, and labor per day in order to make a few bucks revenue (not profit)? AI companies have to blow HUNDREDS OF BILLIONS on massive data center complexes in order to train their bots, and then the energy cost and water cost of running them adds a couple more million a day. ALL so they can make negative hundreds of dollars on every prompt you can dream of.
The ONLY reason AI firms are still a thing in the current tech tree is because Techbros everywhere have convinced the uberwealthy VC firms that AGI is RIGHT AROUND THE CORNER, and will save them SO much money on labor and efficiency that it'll all be worth it in permanent, pure, infinite profit. If that sounds like too much of a pipe dream to be realistic, congratulations, you're a sane and rational human being.
It’s more than simply astonishing, it’s mind-blowingly bonkers how much money they have to burn to see ANY amount of return
See, that's the trick, and it's used by LOADS of startups:
You don't actually have to see a return... You just have to have a good story showing there MAY be a GIANT return. The founders collect enormous salaries (Funded by VC dollars, not their own), they burn through the money to create more illusion, then ask for more, then burn through that, foretelling of the coming days when the money is just coming!
Meanwhile, just before it's "projected" to become insanely profitable, they sell out to someone, walk away with a giant check, and the product evaporates.
-
This post did not contain any content.
I hate that these bots ruin my read it later app.
-
The AI doesn't just do a web search and display a page, in grabs the search results and scrapes multiple pages far faster than a person could.
It doesn't matter whether a human initiated it when the load on the website is far, far higher and more intrusive in a shorter period of time with AI compared to a human doing a web search and reading the cobtent themselves.
It creates web requests faster than a human could. It does not create web requests as fast as possible like a crawler does.
Websites can handle a lot of human user traffic, even if some human users are making 5x the requests of other users due to using automation tools (like LLM summarization).
A website cannot handle a single bot which can, by itself, can generate tens of millions of times as much traffic as a human.
Cloudflare’s method of detecting bots is to attempt to fingerprint the browser and user behavior to detect automations which are usually run in environments that can’t render the content. They did this because, until now, users did not use automation tools so detecting and blocking automation tools was a way to get most of the bots.
Now, users do use automation tools and so this method of classification is dated and misclassifying human generated traffic.
-
It isn’t opt in.
You can block all bot page scraping, and also block user initiated AI tools or you can block no traffic.
There isn’t an option to block bot page scraping but allow user initiated AI tools.
Because, as the article points out, Cloudflare is not able to distinguish between the two
Thats not true, I just viewed my panel in CF, and Perplexity is an optional block, which by default is off.
-
The amount of people just reacting to the headline in the comments on these kinds of articles is always surprising.
Your browser acts as an agent too, you don’t manually visit every script link, image source and CSS file. Everyone has experienced how annoying it is to have your browser be targeted by Cloudflare.
There’s a pretty major difference between a human user loading a page and having it summarized and a bot that is scraping 1500 pages/second.
Cheering for Cloudflare to be the arbiter of what technologies are allowed is incredibly short sighted. They exist to provide their clients with services, including bot mitigation. But a user initiated operation isn’t the same as a bot.
Which is the point of the article and the article’s title.
It isn’t clear why OP had to alter the headline to bait the anti-ai crowd.
Thank you for trying to fight the irrational anti-AI brainrot on lemmy! It’s probably a lost cause, but your efforts are appreciated
-
When sites put challenges like Anubis or other measures to authenticate that the viewer isn't a robot, and scrapers then employ measures to thwart that authentication (via spoofing or other means) I think that's a reasonable violation of the CFAA in spirit — especially since these mass scraping activities are getting attention for the damage they are causing to site operators (another factor in the CFAA, and one that would promote this to felony activity.)
The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.
Nah, that would also mean using Newpipe, YoutubeDL, Revanced, and Tachiyomi would be a crime, and it would only take the re-introduction of WEI to extend that criminalization to the rest of the web ecosystem. It would be extremely shortsighted and foolish of me to cheer on the criminalization of user spoofing and browser automation because of this.
-
This post did not contain any content.
-
…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.
That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.
They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.
It is also horribly inefficient and works like a small scale DDOS attack.
-
I think in Cloudflare’s case the free tier website owners are more an example of just giving the users a limited product in hopes of enticing them to upgrade to the paid product with more features and better performance. Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot, but I don’t think that really gives the same ROI like Facebook or other services extract from their “free” services where the users are the actual product.
It's a spectrum and Cloudflare has snuffed out or gobbled up quite everyone they need to before the end the honeymoon phase.
-
I don't see how categorically blocking non-human traffic is irrational given the current environment of AI scanning. And what's rational about demanding cloudflare distinguish between the 'good guy' AI and 'bad guy' AI without proposing any methodology for doing so.
It is blocking human traffic, that’s the entire premise of the article.
Attempting to say that this is non-human traffic makes no sense if you understand how a browser works. When you load a website your browser, acting as an agent, does a lot of tasks for you and generates a bunch of web requests across multiple hosts.
Your browser downloads the HTML from the website, it parses the contents of that file for image, script and CSS links, it retrieves them from the various websites which host them, it interprets the JavaScript and makes any web requests based on that. Often the scripting has a user constantly sending requests to a website in order to update the content (like using web based email).
All of this is automated and done on your behalf. But you wouldn’t classify this traffic as non-human because a person told the browser to do that task and the task resulted in a flurry of web requests and processing on behalf of the user.
Summarization is just another task, which is requested by a human.
The primary difference, and why it is incorrectly classified, is because the summarization tools use a stripped down browser. It doesn’t need JavaScript to be rendered or CSS to change the background color so it doesn’t waste resources on rendering that stuff.
Cloudflare detects this kind of environment, one that doesn’t fully render a page, and assumes that it is a web scraper. This used to be a good way to detect scraping because the average user didn’t use web automation tools and scrapers did.
Regular users do use automation tools now, so detecting automation doesn’t guarantee that the agent is a scraper bot.
The point of the article is that their heuristics doesn’t work anymore because users use automation tools in a manner that doesn’t generate tens of millions of requests per second and overwhelm servers and so it shouldn’t classify them the same way.
The point of Cloudflare’s bot blocking is to prevent a single user from overwhelming a site’s resources. These tools don’t do that. Go use any search summarization tool and see for yourself, it usually grabs one page from each source. That kind of traffic uses less resources than a human user (because it only grabs static content).
-
Thank you for trying to fight the irrational anti-AI brainrot on lemmy! It’s probably a lost cause, but your efforts are appreciated
It’s an uphill battle. Lots of motivated reasoning and bad faith arguments
e: looks like Cloudflare is adding this distinction in their control panel. So it seems like they, too disagree with the brain rot. Source: https://lemmy.world/post/34677771/18880370
-
Thats not true, I just viewed my panel in CF, and Perplexity is an optional block, which by default is off.
They must be A/B testing a new feature then, it’s not on mine
-
This post did not contain any content.
good, that means it’s working
I’m gonna be frustrated (though not surprised) if the response is anything other than this.
-
That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.
You say, just as news breaks that the top German court has over turned a decision that declared "AD blocking isn't piracy"
-
You say, just as news breaks that the top German court has over turned a decision that declared "AD blocking isn't piracy"
Unauthorized access into a computer system and “Piracy” are two very different things.
-
Unauthorized access into a computer system and “Piracy” are two very different things.
Please instruct me on how I go to the timeline where the legal system always makes decisions based on logic, reasoning, evidence and fairness and not...the opposite...of all those things
You have a lot of trust placed in the courts to actually do the right thing
-
It is blocking human traffic, that’s the entire premise of the article.
Attempting to say that this is non-human traffic makes no sense if you understand how a browser works. When you load a website your browser, acting as an agent, does a lot of tasks for you and generates a bunch of web requests across multiple hosts.
Your browser downloads the HTML from the website, it parses the contents of that file for image, script and CSS links, it retrieves them from the various websites which host them, it interprets the JavaScript and makes any web requests based on that. Often the scripting has a user constantly sending requests to a website in order to update the content (like using web based email).
All of this is automated and done on your behalf. But you wouldn’t classify this traffic as non-human because a person told the browser to do that task and the task resulted in a flurry of web requests and processing on behalf of the user.
Summarization is just another task, which is requested by a human.
The primary difference, and why it is incorrectly classified, is because the summarization tools use a stripped down browser. It doesn’t need JavaScript to be rendered or CSS to change the background color so it doesn’t waste resources on rendering that stuff.
Cloudflare detects this kind of environment, one that doesn’t fully render a page, and assumes that it is a web scraper. This used to be a good way to detect scraping because the average user didn’t use web automation tools and scrapers did.
Regular users do use automation tools now, so detecting automation doesn’t guarantee that the agent is a scraper bot.
The point of the article is that their heuristics doesn’t work anymore because users use automation tools in a manner that doesn’t generate tens of millions of requests per second and overwhelm servers and so it shouldn’t classify them the same way.
The point of Cloudflare’s bot blocking is to prevent a single user from overwhelming a site’s resources. These tools don’t do that. Go use any search summarization tool and see for yourself, it usually grabs one page from each source. That kind of traffic uses less resources than a human user (because it only grabs static content).
so how would cloudflare tell the difference between the good 'stripped down' queries and the bad? still not hearing how that is supposed to work. if there's no way to tell the difference, the baby will be thrown out with the bathwater, and I can't blame them.
-
I can’t get over their CEO that looks like a nine year old. Not sure what it is about him
I think it's the beard, it makes his cheeks look puffed up a bit. His whole expression kinda looks like a grouchy toddler.