The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall
-
This post did not contain any content.
This is why companies like Perplexity and OpenAI are creating browsers.
-
gaining unauthorized access to a computer system
And my point is that defining "unauthorized" to include visitors using unauthorized tools/methods to access a publicly visible resource would be a policy disaster.
If I put a banner on my site that says "by visiting my site you agree not to modify the scripts or ads displayed on the site," does that make my visit with an ad blocker "unauthorized" under the CFAA? I think the answer should obviously be "no," and that the way to define "authorization" is whether the website puts up some kind of login/authentication mechanism to block or allow specific users, not to put a simple request to the visiting public to please respect the rules of the site.
To me, a robots.txt is more like a friendly request to unauthenticated visitors than it is a technical implementation of some kind of authentication mechanism.
Scraping isn't hacking. I agree with the Third Circuit and the EFF: If the website owner makes a resource available to visitors without authentication, then accessing those resources isn't a crime, even if the website owner didn't intend for site visitors to use that specific method.
When sites put challenges like Anubis or other measures to authenticate that the viewer isn't a robot, and scrapers then employ measures to thwart that authentication (via spoofing or other means) I think that's a reasonable violation of the CFAA in spirit — especially since these mass scraping activities are getting attention for the damage they are causing to site operators (another factor in the CFAA, and one that would promote this to felony activity.)
The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.
-
Yes your "leave the internet any time you want" strawman is not a good argument.
If allowing perplexity while blocking the bad guys is so easy why not find a service that does that for you?
The topic is that Cloudflare is classifying human sourced traffic as bot sourced traffic.
Saying “Just don’t use it” is a straw man. It doesn’t change the fact that Cloudflare, one of the largest CDNs representing a significant portion of the websites and services in the US, is misclassifying traffic.
I used mine intentionally while knowing it was a straw man, did you?
The same with “if it’s so easy, just don’t use it” hopefully for obvious reasons.
This affects both the customers of Cloudflare (the web service owners) as well as the users of the web services. A single site/user opting out doesn’t change the fact that a large portion of the Internet is classifying human sourced traffic as bot sourced traffic.
-
Skill issue. Cope and seethe
this made me lol
-
There’s no difference in server load between a user looking at a page and a user using an AI tool to summarize the page.
The AI companies already overwhelmed sites to get training data and are repeating their shitty scraping practices when users interact with their AI. It’s the same fucking thing.
You either didn’t read the article or are deliberately making bad faith arguments. The entire point of the article is that the traffic that they’re referring to is initiated by a user, just like when you type an address into your browser’s address bar.
This traffic, initiated by a user, creates the same server load as that same user loading the page in a browser.
Yes, mass scraping of web pages creates a bunch of server load. This was the case before AI was even a thing.
This situation is like Cloudflare presenting was a captcha in order to load each individual image, css or JavaScript asset into a web browser because bot traffic pretends to be a browser.
I don’t think it’s too hard to understand that a bot pretending to be a browser and a human operated browser are two completely different things and classifying them as the same (and captchaing them) would be a classification error.
This is exactly the same kind of error. Even if you personally believe that users using AI tools should be blocked, not everyone has the same opinion. If Cloudflare can’t distinguish between bot requests and human requests then their customers can’t opt out and allow their users to use AI tools even if they want to.
There is no difference between emptying a glass of water and draining swimming pool either if you ignore the total volume of water.
-
It isn’t opt in.
You can block all bot page scraping, and also block user initiated AI tools or you can block no traffic.
There isn’t an option to block bot page scraping but allow user initiated AI tools.
Because, as the article points out, Cloudflare is not able to distinguish between the two
There’s no appreciable difference on how they affect systems between the two for site owners.
-
The topic is that Cloudflare is classifying human sourced traffic as bot sourced traffic.
Saying “Just don’t use it” is a straw man. It doesn’t change the fact that Cloudflare, one of the largest CDNs representing a significant portion of the websites and services in the US, is misclassifying traffic.
I used mine intentionally while knowing it was a straw man, did you?
The same with “if it’s so easy, just don’t use it” hopefully for obvious reasons.
This affects both the customers of Cloudflare (the web service owners) as well as the users of the web services. A single site/user opting out doesn’t change the fact that a large portion of the Internet is classifying human sourced traffic as bot sourced traffic.
LOL "human sourced traffic" oh the tragedy. I for one am rooting for perplexity to go out of business forever.
-
There is no difference between emptying a glass of water and draining swimming pool either if you ignore the total volume of water.
I, too, can make any argument sound silly if I want to argue in bad faith.
A user cannot physically generate as much traffic as a bot.
Just like a glass of water cannot physically contain as much water as a swimming pool, so pretending the two are equal is ignorant in both cases.
-
This post did not contain any content.
Well... Good.
-
I, too, can make any argument sound silly if I want to argue in bad faith.
A user cannot physically generate as much traffic as a bot.
Just like a glass of water cannot physically contain as much water as a swimming pool, so pretending the two are equal is ignorant in both cases.
A user cannot physically generate as much traffic as a bot.
You are so close to getting it!
-
rare cloudflare w
As far as security is concerned, their w's are pretty common tbh. It's just the whole centralization issue.
-
A user cannot physically generate as much traffic as a bot.
You are so close to getting it!
And you’re not even close.
-
…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.
That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.
It's worth giving the article a read. It seems that they're not using the data for training, but for real-time results.
-
There’s no appreciable difference on how they affect systems between the two for site owners.
There’s a pretty significant difference in request rate. A tool trying to search and summarize will hit a search engine once, and each website maybe 5 times (if every search engine link points to the site).
A bot trying to scrape content from a website can generate thousands or tens of thousands of requests per second.
-
No I'm telling Perplexity, they can just buy their obstacle
People who use the things you have described, for free
are themselves the products being sold
this is implied in the priceI think in Cloudflare’s case the free tier website owners are more an example of just giving the users a limited product in hopes of enticing them to upgrade to the paid product with more features and better performance. Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot, but I don’t think that really gives the same ROI like Facebook or other services extract from their “free” services where the users are the actual product.
-
LOL "human sourced traffic" oh the tragedy. I for one am rooting for perplexity to go out of business forever.
I for one am rooting for perplexity to go out of business forever.
Yeah, I know.
You’re engaging in motivated reasoning. That’s why you’re saying irrational things, because you’re working backwards from a conclusion (AI bad).
-
Ehhhh, you are gaining access to content due to assumption you are going to interact with ads and thus, bring revenue to the person and/or company producing said content. If you block ads, you remove authorisation brought to you by ads.
There was no header on the request saying I want ads though
-
And you’re not even close.
The AI doesn't just do a web search and display a page, in grabs the search results and scrapes multiple pages far faster than a person could.
It doesn't matter whether a human initiated it when the load on the website is far, far higher and more intrusive in a shorter period of time with AI compared to a human doing a web search and reading the cobtent themselves.
-
This post did not contain any content.
Good. I went through my CF panel, and blocked some of those "AI Assistants" that by default were open, including Perplexity's.
-
When sites put challenges like Anubis or other measures to authenticate that the viewer isn't a robot, and scrapers then employ measures to thwart that authentication (via spoofing or other means) I think that's a reasonable violation of the CFAA in spirit — especially since these mass scraping activities are getting attention for the damage they are causing to site operators (another factor in the CFAA, and one that would promote this to felony activity.)
The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.
The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.
Silly plebe! Those laws are there to target the working class, not to be used against corporations. See: Copyright.
-
-
-
-
-
California’s Corporate Cover-Up Act Is a Privacy Nightmare: it would let corporations spy on us in secret, gutting long-standing protections without a shred of accountability.
Technology1
-
Millions of Americans Who Have Waited Decades for Fast Internet Connections Will Keep Waiting After the Trump Administration Threw a $42 Billion High-Speed Internet Program Into Disarray.
Technology1
-
-