The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall
-
This post did not contain any content.
I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)
So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.
I completely understand where Perplexity is coming from, but at scale, implementations like
thisPerplexity's are awful for the web.(Edited for clarity)
-
They can use web.archive.org as a cdn(I do that to cloudflare websites). But honestly, cloudflare or not, the internet is broken.
Can you explain please? How can I use archive.org as a cdn for my website?
-
yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.
I'm out of the loop, what's wrong with cloud flare?
-
Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.
This is not about training data, though.
Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.
Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.
I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.
Though technically making all this happen flawlessly is quite a big task.
-
This post did not contain any content.
Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?
Isn’t that a literal computer crime?
-
I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)
So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.
I completely understand where Perplexity is coming from, but at scale, implementations like
thisPerplexity's are awful for the web.(Edited for clarity)
I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.
We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.
-
This is not about training data, though.
Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.
Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.
I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.
Though technically making all this happen flawlessly is quite a big task.
Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.
They are one of the sources!
The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.
Scraping once extensively and scraping a bit less but far more frequently have similar impacts.
-
Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.
They are one of the sources!
The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.
Scraping once extensively and scraping a bit less but far more frequently have similar impacts.
When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won't retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn't a DDoS attack.
Constructing the training material in the first place is a different matter, but if you're asking about fresh events or new APIs, the training data just doesn't cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.
-
This post did not contain any content.
It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.
Cloudflare is the biggest cancer on the web, fucking burn it.
-
they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.
Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It's all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.
-
Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.
So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?
Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.
-
It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.
Cloudflare is the biggest cancer on the web, fucking burn it.
omg ur a hacker
Did you mean Edge on Windows? 'Cause if so, welcome in!
-
It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.
Cloudflare is the biggest cancer on the web, fucking burn it.
I'm on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.
-
Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.
So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?
And I'm assuming if the robots.txt state their UserAgent isn't allowed to crawl, it obeys it, right?
-
I'm out of the loop, what's wrong with cloud flare?
Centralization, mostly, but also their hands-off approach to most fascist content.
-
DoS attacks are already a crime, so of course the need for some kind of solution is clear. But any proposal that gatekeeps the internet and restricts the freedoms with which the user can interact with it is no solution at all. To me, the openness of the web shouldn't be something that people just consider, or are amenable to. It should be the foundation in which all reasonable proposals should consider as a principle truth.
How "open" a website is, is up to the owner, and that's all. Unless we're talking about de-privatizing the internet as a whole, here.
-
I'm on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.
In my case, it's usually the VPN.
-
they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.
Recaptcha v2 does way more than check if the box was checked.
How does Google reCAPTCHA v2 work behind the scenes?
This post refers to Google ReCaptcha v2 (not the latest version) Recently Google introduced a simplified "captcha" verification system (video) that enables users to pass the "captcha" just by clic...
Stack Overflow (stackoverflow.com)
-
Is there some simply deployable PHP honeytrap for AI crawlers?
You could probably route all requests to your site from them, back at themselves, so they DDoS themselves, and on top off it, cost them more because their endpoint needs to process things via their LLM.
-
First we complain that AI steals and trains on our data. Then we complain when it doesn't train. Cool.
I think it boils down to "consent" and "remuneration".
I run a website, that I do not consent to being accessed for LLMs. However, should LLMs use my content, I should be compensated for such use.
So, these LLM startups ignore both consent, and the idea of remuneration.
Most of these concepts have already been figured out for the purpose of law, if we consider websites much akin to real estate: Then, the typical trespass laws, compensatory usage, and hell, even eminent domain if needed ie, a city government can "take over" the boosted post feature to make sure alerts get pushed as widely and quickly as possible.