Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
285 108 718
  • Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

    You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

    It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

    If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.

  • This post did not contain any content.

    Reading with CEO mindset. 3 out of 10 employees can be fired.

  • I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

    This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

  • This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

    it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

  • it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

    If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

  • If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

    It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

  • It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

    Should be easy if it's that bad though

  • Should be easy if it's that bad though

    I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

  • I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

    You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

  • You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

    This is you

  • Threads is nearing X's daily app users, new data shows

    Technology technology
    28
    1
    109 Stimmen
    28 Beiträge
    91 Aufrufe
    3dcadmin@lemmy.relayeasy.com3
    X has declined yes, but threads is growing. Loads more joining recently as well. Most seem to move from FB to threads because they have a Meta account, so have an Insta account soooooo.... Threads is a dumbed down X (I can feel the heat I'm gonna get for that) Meta is by far and away the largest for users - we all know it and it means that promoting threads inside Insta and FB means people will see what it is like. Here in the UK it is refreshingly free of ads, is quick to post/reply/interact and feels new. How long that will last is anybodys guess as per usual
  • Google’s electricity demand is skyrocketing

    Technology technology
    11
    1
    189 Stimmen
    11 Beiträge
    57 Aufrufe
    W
    What's dystopian is that a company like google will fight tooth and nail to remain the sole owner and rights holder to such a tech. A technology that should be made accessible outside the confines of capitalist motives. Such technologies have the potential to lift entire populations out of poverty. Not to mention that they could mitigate global warming considerably. It is simply not in the interest of humanity to allow one or more companies to hold a monopoly over such technology
  • 670 Stimmen
    122 Beiträge
    54 Aufrufe
    T
    It's something Americans say.
  • 80 Stimmen
    14 Beiträge
    25 Aufrufe
    A
    It was very boring.
  • Twitch is getting vertical livestreams

    Technology technology
    20
    1
    11 Stimmen
    20 Beiträge
    74 Aufrufe
    zombiemantis@lemmy.worldZ
    Oh, yeah, that makes sense. I kinda assumed they already supported it, like YouTube Shorts adopting the vertical format for shorts after Ticktock blew up.
  • 347 Stimmen
    51 Beiträge
    171 Aufrufe
    4
    Interestingly it loads today. I have AdAway on my phone and PiHole in my home network
  • The Enshitification of Youtube’s Full Album Playlists

    Technology technology
    3
    1
    108 Stimmen
    3 Beiträge
    24 Aufrufe
    dual_sport_dork@lemmy.worldD
    Especially when the poster does not disclose that it's AI. The perpetual Youtube rabbit hole occasionally lands on one of these for me when I leave it unsupervised, and usually you can tell from the "cover" art. But only if you're looking at it. Because if you just leave it going in the background eventually you start to realize, "Wow, this guy really tripped over the fine line between a groove and rut." Then you click on it and look: Curses! Foiled again. And golly gee, I'm sure glad Youtube took away the option to oughtright block channels. I'm sure that's a total coincidence. W/e. I'm a have-it-on-my-hard-drive kind of bird. Yt-dlp is your friend. Just use it to nab whatever it is you actually want and let your own media player decide how to shuffle and present it. This works great for big name commercial music as well, whereupon the record labels are inevitably dumb enough to post songs and albums in their entirety right there you Youtube. Who even needs piracy sites at that rate? Yoink!
  • 12 Stimmen
    7 Beiträge
    33 Aufrufe
    C
    Sure, he wasn't an engineer, so no, Jobs never personally "invented" anything. But Jobs at least knew what was good and what was shit when he saw it. Under Tim Cook, Apple just keeps putting out shitty unimaginative products, Cook is allowing Apple to stagnate, a dangerous thing to do when they have under 10% market share.