Millions of websites to get 'game-changing' AI bot blocker
-
Until the AI companies find a way around it. Love the idea so hopefully it causes at least 3 days of struggle for the AI crawlers.
Having said that... Can someone else put this in place so we do not have Cloudflare hosting everything where we would just be one intern away from a global outage. Please? Pretty please?
Yeah this will have absolutely no impact to gathering training data.
I assumed it was to block ai agents crawling it during requests, which they’d be unlikely to bypass in the web ui.
But no company spending millions on training will hesitate to have an agent appear as a regular desktop user to scrape data.
-
Can you DRM a crawl ?
You can if you're Cloudflare.
-
Yeah this will have absolutely no impact to gathering training data.
I assumed it was to block ai agents crawling it during requests, which they’d be unlikely to bypass in the web ui.
But no company spending millions on training will hesitate to have an agent appear as a regular desktop user to scrape data.
Does cloudflare still look at the agent? I thought they have more reliable data points.
-
Does cloudflare still look at the agent? I thought they have more reliable data points.
I meant an ai agent not the browser agent. All data points can be spoofed and if not they’ll pay a human to scrape before they pay for content.
-
This post did not contain any content.
So... Proprietary Anubis?
-
This post did not contain any content.
I didnt read "bot blocker" wrong, thats for sure..
-
I meant an ai agent not the browser agent. All data points can be spoofed and if not they’ll pay a human to scrape before they pay for content.
Okay, fair enough, I thought you meant just the user agent. Trouble with having a bot make it look like an actual user is looking at the data, is that it's slow and inefficient. Trouble with paying humans to scrape the data is that it's slow and inefficient. These companies want to ingest data ridiculously fast because there's so much of it. If all else fails, they'll resort to paying the content creators. But only if it's data they really do think gives their model a competitive edge in some metric and they can't pirate it. E.g I can see them paying for scientific research they can't get from libgen, but not some rando's blog post or local news website.
-
are you comfortable with a single corporation having control over this sort of service? the current government is obviously not ideal but that shouldn’t stop us from regulating monopolies.
are you comfortable with a single corporation having control over this sort of service?
Honestly? A tiny bit more than a single country. I have at least some miniscule control over the corporation through voting and local regulations that international corporations must follow, whereas I have absolutely no formal influence on US govt.
-
Can you DRM a crawl ?
Oh yes DRM whole internet and wire it to Personal ID. Wet dream.
-
This post did not contain any content.
I wish there was an alternative (possibly European) to Cloudflare, because it's so scary to put all eggs in one basket.
-
I really wish the answer was a legally enforced robots.txt file that very easily allowed any web data any organization or individual user is posting to script out what the permissions are. I often use a LLM as a search and most of the time the citations are pretty decent and I use those to link out to source content.
I run a small blog and I'd love to get indexed in a LLM, not blocked, as long as I was assured a reference link for any content used and had some legal recourse if I found my data was being misused.
I don't love the answer being another mega corporation posing as a white knight looking to skim some money off of the "loophole" that is AI copyright infringement.How would you legally enforce robots.txt? It's not a legally sound system.