Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
285 108 835
  • Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

    Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

  • What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

    I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

  • Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

    You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

    It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

    If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.

  • This post did not contain any content.

    Reading with CEO mindset. 3 out of 10 employees can be fired.

  • I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

    This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

  • This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

    it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

  • it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

    If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

  • If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

    It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

  • It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

    Should be easy if it's that bad though

  • Should be easy if it's that bad though

    I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

  • I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

    You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

  • You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

    This is you

  • 86 Stimmen
    12 Beiträge
    5 Aufrufe
    R
    TIL. Never used either.
  • 61 Stimmen
    17 Beiträge
    68 Aufrufe
    anzo@programming.devA
    I’ll probably never trust anything they’ve touched until I’ve taken it apart and put it back together again. Me too. But the vast majority of users need guardrails, and have a different threat model. Even those that also care about privacy, if they just want a solution that comes by default, this adtech 'fake' or 'superficial' solution does provide something. And anything is more than nothing.
  • 2 Stimmen
    1 Beiträge
    12 Aufrufe
    Niemand hat geantwortet
  • Open Source CAD In The Browser

    Technology technology
    19
    1
    152 Stimmen
    19 Beiträge
    77 Aufrufe
    xavier666@lemm.eeX
    Electron: Heyyyyyyy
  • 1k Stimmen
    95 Beiträge
    17 Aufrufe
    G
    Obviously the law must be simple enough to follow so that for Jim’s furniture shop is not a problem nor a too high cost to respect it, but it must be clear that if you break it you can cease to exist as company. I think this may be the root of our disagreement, I do not believe that there is any law making body today that is capable of an elegantly simple law. I could be too naive, but I think it is possible. We also definitely have a difference on opinion when it comes to the severity of the infraction, in my mind, while privacy is important, it should not have the same level of punishments associated with it when compared to something on the level of poisoning water ways; I think that a privacy law should hurt but be able to be learned from while in the poison case it should result in the bankruptcy of a company. The severity is directly proportional to the number of people affected. If you violate the privacy of 200 million people is the same that you poison the water of 10 people. And while with the poisoning scenario it could be better to jail the responsible people (for a very, very long time) and let the company survive to clean the water, once your privacy is violated there is no way back, a company could not fix it. The issue we find ourselves with today is that the aggregate of all privacy breaches makes it harmful to the people, but with a sizeable enough fine, I find it hard to believe that there would be major or lasting damage. So how much money your privacy it's worth ? 6 For this reason I don’t think it is wise to write laws that will bankrupt a company off of one infraction which was not directly or indirectly harmful to the physical well being of the people: and I am using indirectly a little bit more strict than I would like to since as I said before, the aggregate of all the information is harmful. The point is that the goal is not to bankrupt companies but to have them behave right. The penalty associated to every law IS the tool that make you respect the law. And it must be so high that you don't want to break the law. I would have to look into the laws in question, but on a surface level I think that any company should be subjected to the same baseline privacy laws, so if there isn’t anything screwy within the law that apple, Google, and Facebook are ignoring, I think it should apply to them. Trust me on this one, direct experience payment processors have a lot more rules to follow to be able to work. I do not want jail time for the CEO by default but he need to know that he will pay personally if the company break the law, it is the only way to make him run the company being sure that it follow the laws. For some reason I don’t have my usual cynicism when it comes to this issue. I think that the magnitude of loses that vested interests have in these companies would make it so that companies would police themselves for fear of losing profits. That being said I wouldn’t be opposed to some form of personal accountability on corporate leadership, but I fear that they will just end up finding a way to create a scapegoat everytime. It is not cynicism. I simply think that a huge fine to a single person (the CEO for example) is useless since it too easy to avoid and if it really huge realistically it would be never paid anyway so nothing usefull since the net worth of this kind of people is only on the paper. So if you slap a 100 billion file to Musk he will never pay because he has not the money to pay even if technically he is worth way more than that. Jail time instead is something that even Musk can experience. In general I like laws that are as objective as possible, I think that a privacy law should be written so that it is very objectively overbearing, but that has a smaller fine associated with it. This way the law is very clear on right and wrong, while also giving the businesses time and incentive to change their practices without having to sink large amount of expenses into lawyers to review every minute detail, which is the logical conclusion of the one infraction bankrupt system that you seem to be supporting. Then you write a law that explicitally state what you can do and what is not allowed is forbidden by default.
  • GeForce GTX 970 8GB mod is back for a full review

    Technology technology
    1
    34 Stimmen
    1 Beiträge
    12 Aufrufe
    Niemand hat geantwortet
  • 0 Stimmen
    6 Beiträge
    32 Aufrufe
    H
    Then that's changed since the last time I toyed with the idea. Which, granted, was probably 20 years ago...
  • 0 Stimmen
    7 Beiträge
    10 Aufrufe
    C
    Oh this is a good callout, I'm definitely using wired and not wireless.