Skip to content

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Technology
237 123 69
  • Recaptcha v2 does way more than check if the box was checked.

    you're not wrong, but it also allows more than 99.8% of the bot traffic through too on text challenges. Its like the TSA of website security. Its mostly there to keep the user busy while cloudflare places itself in a man in the middle of your encrypted connection to a third party. The only difference between cloudflare and a malicious attacker is cloudflares stated intention not to be evil. With that and 3 dollars I can buy myself a single hard shell taco from tacobell.

  • Site owners currently do and should have the freedom to decide who is and is not allowed to access the data, and to decide for what purpose it gets used for. Idgaf if you think scraping is malicious or not, it is and should be illegal to violate clear and obvious barriers against them at the cost of the owners and unsanctioned profit of the scrapers off of the work of the site owners.

    to decide for what purpose it gets used for

    Yeah, fuck everything about that. If I'm a site visitor I should be able to do what I want with the data you send me. If I bypass your ads, or use your words to write a newspaper article that you don't like, tough shit. Publishing information is choosing not to control what happens to the information after it leaves your control.

    Don't like it? Make me sign an NDA. And even then, violating an NDA isn't a crime, much less a felony punishable by years of prison time.

    Interpreting the CFAA to cover scraping is absurd and draconian.

  • That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon.

    It already has been captured, properly in law, in most places. We can use the US as an example: Both intellectual property and real property have laws already that cover these very items.

    What does it mean for you and how is it different from being accessed by a user?

    Well, does a user burn up gigawatts of power, to access my site every time? That's a huge different.

    Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

    Depends on the terms of service I set for that service.

    Is it okay for a person to access your site?

    Sure!

    Is it okay for a script written by that person to fetch data every day automatically?

    Sure! As long as it doesn't cause problems for me, the creator and hoster of said content.

    Would it be okay for a user to dump a page of your site with a headless browser?

    See above. Both power usage and causing problems for me.

    Would it be okay to let an LLM take a look at it to extract info required by a user?

    No. I said, I do not want my content and services to be used by and for LLMs.

    Have you heard about changedetection.io project?

    I have now. And should a user want to use that service, that service, which charges 8.99/month for it needs to pay me a portion of that, or risk having their service blocked.

    There no need to use it, as I already provide RSS feeds for my content. Use the RSS feed, if you want updates.

    If some of these sound unfair to you, you might want to put a DRM on your data or something.

    Or, I can just block them, via a service like Cloud Flare. Which I do.

    Would you expect a compensation from me after reading your comment?

    None. Unless you're wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

    Both intellectual property and real property have laws already that cover these very items.

    And it causes a lot of trouble to many people and pains me specifically. Information should not be gated or owned in a way that would make it illegal for anyone to access it under proper conditions. License expiration causing digital work to die out, DRM causing software to break, idiotic license owners not providing appropriate service, etc.

    Well, does a user burn up gigawatts of power, to access my site every time?

    Doing a GET request doesn't do that.

    As long as it doesn't cause problems for me, the creator and hoster of said content.

    What kind of problems that would be?

    Both power usage and causing problems for me.

    ?? How? And what?

    do not want my content and services to be used by and for LLMs.

    You have to agree that at one point "be used by LLM" would not be different from "be used by a user".

    which charges 8.99/month

    It's self-hosted and free.

    Use the RSS feed, if you want updates.

    How does that prohibit usage and processing of your info? That sounds like "I won't be providing any comments on Lemmy website, if you want my opinion you can mail me at a@b.com"

    I can just block them, via a service like Cloud Flare. Which I do.

    That will never block all of them. Your info will be used without your consent and you will not feel troubled from it. So you might not feel troubled if more things do the same.

    None. Unless you're wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

    What if I use my local hosted LLM? Anyway, the point is, selling text can't work well, and you're going to spend much more resources on collecting and summarizing data about how your text was used and how others benefited from it, in order to get compensation, than it worths.

    Also, it might be the case that some information is actually worthless when compared to a service provided by things like LLM, even though they use that worthless information in the process.

    I'm all for killing off LLMs, btw. Concerns of site makers who think they are being damaged by things like Perplexity are nothing compared to what LLMs do to the world. Maybe laws should instead make it illegal to waste energy. Before energy becomes the main currency.

  • to decide for what purpose it gets used for

    Yeah, fuck everything about that. If I'm a site visitor I should be able to do what I want with the data you send me. If I bypass your ads, or use your words to write a newspaper article that you don't like, tough shit. Publishing information is choosing not to control what happens to the information after it leaves your control.

    Don't like it? Make me sign an NDA. And even then, violating an NDA isn't a crime, much less a felony punishable by years of prison time.

    Interpreting the CFAA to cover scraping is absurd and draconian.

    If you want anybody and everyone to be able to use everything you post for any purpose, right on, good for you, but don't try to force your morality on others who rely on their writing, programming, and artworks to make a living to survive.

  • If you want anybody and everyone to be able to use everything you post for any purpose, right on, good for you, but don't try to force your morality on others who rely on their writing, programming, and artworks to make a living to survive.

    I'm gonna continue to use ad blockers and yt-dlp, and if you think I'm a criminal for doing so, I'm gonna say you don't understand either technology or criminal law.

  • I'm gonna continue to use ad blockers and yt-dlp, and if you think I'm a criminal for doing so, I'm gonna say you don't understand either technology or criminal law.

    Thats a crime yeah and if Alphabet co wants to sue you for $1.34 damages then they have that right, just as we should have the right to sue them if their AI crawlers make our site unusable and plagiarize our work to the effect of thousands of dollars, or even press charges for the criminal act of intentional disruption of services.

  • Both intellectual property and real property have laws already that cover these very items.

    And it causes a lot of trouble to many people and pains me specifically. Information should not be gated or owned in a way that would make it illegal for anyone to access it under proper conditions. License expiration causing digital work to die out, DRM causing software to break, idiotic license owners not providing appropriate service, etc.

    Well, does a user burn up gigawatts of power, to access my site every time?

    Doing a GET request doesn't do that.

    As long as it doesn't cause problems for me, the creator and hoster of said content.

    What kind of problems that would be?

    Both power usage and causing problems for me.

    ?? How? And what?

    do not want my content and services to be used by and for LLMs.

    You have to agree that at one point "be used by LLM" would not be different from "be used by a user".

    which charges 8.99/month

    It's self-hosted and free.

    Use the RSS feed, if you want updates.

    How does that prohibit usage and processing of your info? That sounds like "I won't be providing any comments on Lemmy website, if you want my opinion you can mail me at a@b.com"

    I can just block them, via a service like Cloud Flare. Which I do.

    That will never block all of them. Your info will be used without your consent and you will not feel troubled from it. So you might not feel troubled if more things do the same.

    None. Unless you're wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

    What if I use my local hosted LLM? Anyway, the point is, selling text can't work well, and you're going to spend much more resources on collecting and summarizing data about how your text was used and how others benefited from it, in order to get compensation, than it worths.

    Also, it might be the case that some information is actually worthless when compared to a service provided by things like LLM, even though they use that worthless information in the process.

    I'm all for killing off LLMs, btw. Concerns of site makers who think they are being damaged by things like Perplexity are nothing compared to what LLMs do to the world. Maybe laws should instead make it illegal to waste energy. Before energy becomes the main currency.

    Information should not be gated or owned in a way that would make it illegal for anyone to access it under proper conditions.

    Then you don't believe content creators should have any control over their own works?

    The "proper conditions" are deemed by the content creator, not the consumers.

    Doing a GET request doesn’t do that.

    Not at all. It consumes at most, a watt.

    What kind of problems that would be?

    Increasing my hosting bill, to accommodate the senseless traffic being sent my way?

    Outages for my site, making my content unavailable for legitimate users?

    You have to agree that at one point “be used by LLM” would not be different from “be used by a user”.

    Not at all. LLMs are not users.

    It’s self-hosted and free.

    If you want, or they charge for the hosted version. If they want to use a paid for version, then they can divert some of that revenue to me, the creator, because without creators, they would have no product.

    How does that prohibit usage and processing of your info? That sounds like “I won’t be providing any comments on Lemmy website, if you want my opinion you can mail me at a@b.com

    That's a apples and oranges comparison, and you know it.

    That will never block all of them. Your info will be used without your consent and you will not feel troubled from it. So you might not feel troubled if more things do the same.

    Perplexity seems to be troubled by it.

    What if I use my local hosted LLM? Anyway, the point is, selling text can’t work well, and you’re going to spend much more resources on collecting and summarizing data about how your text was used and how others benefited from it, in order to get compensation, than it worths.

    If selling text can't work well, then why do LLM products insist on using my text, to sell it?

    Also, it might be the case that some information is actually worthless when compared to a service provided by things like LLM, even though they use that worthless information in the process.

    LLMs are a net negative, as far as costs go. They consume far more in resources than they provide in benefit. If my information was worthless without an LLM, it's worthless with an LLM, therefore, LLMs don't need to access it. Periodt.

    The bottom line? Content creators get the first say in how their content is used, and consumed. You are not entitled to their labor, for free, and without condition.

  • Information should not be gated or owned in a way that would make it illegal for anyone to access it under proper conditions.

    Then you don't believe content creators should have any control over their own works?

    The "proper conditions" are deemed by the content creator, not the consumers.

    Doing a GET request doesn’t do that.

    Not at all. It consumes at most, a watt.

    What kind of problems that would be?

    Increasing my hosting bill, to accommodate the senseless traffic being sent my way?

    Outages for my site, making my content unavailable for legitimate users?

    You have to agree that at one point “be used by LLM” would not be different from “be used by a user”.

    Not at all. LLMs are not users.

    It’s self-hosted and free.

    If you want, or they charge for the hosted version. If they want to use a paid for version, then they can divert some of that revenue to me, the creator, because without creators, they would have no product.

    How does that prohibit usage and processing of your info? That sounds like “I won’t be providing any comments on Lemmy website, if you want my opinion you can mail me at a@b.com

    That's a apples and oranges comparison, and you know it.

    That will never block all of them. Your info will be used without your consent and you will not feel troubled from it. So you might not feel troubled if more things do the same.

    Perplexity seems to be troubled by it.

    What if I use my local hosted LLM? Anyway, the point is, selling text can’t work well, and you’re going to spend much more resources on collecting and summarizing data about how your text was used and how others benefited from it, in order to get compensation, than it worths.

    If selling text can't work well, then why do LLM products insist on using my text, to sell it?

    Also, it might be the case that some information is actually worthless when compared to a service provided by things like LLM, even though they use that worthless information in the process.

    LLMs are a net negative, as far as costs go. They consume far more in resources than they provide in benefit. If my information was worthless without an LLM, it's worthless with an LLM, therefore, LLMs don't need to access it. Periodt.

    The bottom line? Content creators get the first say in how their content is used, and consumed. You are not entitled to their labor, for free, and without condition.

    Don't feel like spending time on this anymore. To me you are not different from idiots who destroys information once they can't sell it anymore, who sue webarchive, who calls pirated copy a lost sale, who shut down game servers etc. LLM might be worse than those but Perplexity is certainly a lesser player in the field.

  • Don't feel like spending time on this anymore. To me you are not different from idiots who destroys information once they can't sell it anymore, who sue webarchive, who calls pirated copy a lost sale, who shut down game servers etc. LLM might be worse than those but Perplexity is certainly a lesser player in the field.

    LLM might be worse than those but Perplexity is certainly a lesser player in the field.

    Its a good thing I don't just block Perplexity, but all of the LLMs.

    And I wont comment on the rest of this, but lets consider another form of property: Real estate.

    You own a plot of land. Should others be able to use it, however they feel, whenever they feel like? Or should you have a say in how it gets used?

    If you feel like you should have exclusive say in how real estate you own is used and when and by whom, why is intellectual property any different? There must be value in using it, so what's wrong with revenues generated by that use being shared (At least) with the creator?

    Last I checked, I'm not seeing rev shares from any of these LLMs that have certainly used my code and other content to train?

  • Thats a crime yeah and if Alphabet co wants to sue you for $1.34 damages then they have that right, just as we should have the right to sue them if their AI crawlers make our site unusable and plagiarize our work to the effect of thousands of dollars, or even press charges for the criminal act of intentional disruption of services.

    Thats a crime yeah and if Alphabet co wants to sue you for $1.34 damages then they have that right

    So yeah, I stand by my statement that anyone thinks this is a crime, or should be a crime, has a poor understanding of either the technology or the law. In this case, even mentioning Alphabet suing for damages means that you don't know the difference between criminal law and civil law.

    press charges for the criminal act of intentional disruption of services

    That's not a crime, and again reveals gaps in your knowledge on this topic.

  • Thats a crime yeah and if Alphabet co wants to sue you for $1.34 damages then they have that right

    So yeah, I stand by my statement that anyone thinks this is a crime, or should be a crime, has a poor understanding of either the technology or the law. In this case, even mentioning Alphabet suing for damages means that you don't know the difference between criminal law and civil law.

    press charges for the criminal act of intentional disruption of services

    That's not a crime, and again reveals gaps in your knowledge on this topic.

    That is actually a crime, you will get prison for DDoS in USA, UK, and EU. Presumably you will disappear if you do it in China.

  • OpenAI rolls out cheapest ChatGPT plan at $4.6 per month in India

    Technology technology
    1
    1
    8 Stimmen
    1 Beiträge
    2 Aufrufe
    Niemand hat geantwortet
  • 283 Stimmen
    55 Beiträge
    209 Aufrufe
    S
    I really don't understand the "LLM as therapy" angle. There's no way people using these services understand what is happening underneath. So wouldn't this just be textbook fraud then? Surely they're making claims that they're not able to deliver. I have no problem with LLM technology and occasionally find it useful, I have a problem with grifters.
  • UK Plans AI Experiment on Children Seeking Asylum

    Technology technology
    12
    1
    79 Stimmen
    12 Beiträge
    48 Aufrufe
    A
    Companies that tested their technology in a handful of supermarkets, pubs, and on websites set them to predict whether a person looks under 25, not 18, allowing a wide error margin for algorithms that struggle to distinguish a 17-year-old from a 19-year-old. AI face scans were never designed for children seeking asylum, and risk producing disastrous, life-changing errors. Algorithms identify patterns in the distance between nostrils and the texture of skin; they cannot account for children who have aged prematurely from trauma and violence. They cannot grasp how malnutrition, dehydration, sleep deprivation, and exposure to salt water during a dangerous sea crossing might profoundly alter a child’s face. Goddamn, this is horrible. Imagine leaving shitty AI to determine the fate of this girl : 'Psychologically broken,' 8-year-old Sama loses her hair
  • 137 Stimmen
    41 Beiträge
    293 Aufrufe
    E
    Yuck indeed. People tried many ways to get around it, back when I was still using an US variant Samsung Note 9, people went as far as using a leaked engineering/preproduction ROM, which can be flashed using Samsung's official tool because it does have the correct key for the locked bootloader to accept, being built and compiled by Samsung, and because it's an engineering ROM it would give you root and everything despite of the bootloader still being locked. But it was an exceptionally rare leak, and it was only meant for preproduction for a reason, it is very VERY unstable and not exactly usable for a daily driver lol So happy I am leaving all that BS from Samsung behind with my current Sony Xperia 1 VI which is bootloader-unlocked and rooted and deeply modded and truly my own device lol
  • 0 Stimmen
    1 Beiträge
    30 Aufrufe
    Niemand hat geantwortet
  • Say Hello to the World's Largest Hard Drive, a Massive 36TB Seagate

    Technology technology
    263
    1
    615 Stimmen
    263 Beiträge
    4k Aufrufe
    M
    Really sad that S3 prices are still that high... also hetzner storage boxes
  • (azazoaoz)

    Technology technology
    1
    1
    0 Stimmen
    1 Beiträge
    25 Aufrufe
    Niemand hat geantwortet
  • Welcome to the web we lost

    Technology technology
    22
    1
    181 Stimmen
    22 Beiträge
    249 Aufrufe
    C
    Is it though? Its always far easier to be loud and obnoxious than do something constructive, even with the internet and LLMs, in fact those things are amplifiers which if anything make the attention imbalance even more drastic and unrepresentative of actual human behaviour. In the time it takes me to write this comment some troll can write a dozen hateful ones, or a bot can write a thousand. Doesn't mean humans are shitty in a 1000/1 ratio, just means shitty people can now be a thousand times louder.