Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
285 108 690
  • What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

    I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

  • Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

    You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

    It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

    If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.

  • This post did not contain any content.

    Reading with CEO mindset. 3 out of 10 employees can be fired.

  • I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

    This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

  • This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

    it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

  • it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

    If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

  • If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

    It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

  • It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

    Should be easy if it's that bad though

  • Should be easy if it's that bad though

    I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

  • I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

    You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

  • You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

    This is you

  • 196 Stimmen
    30 Beiträge
    11 Aufrufe
    D
    This guy gets it. And from my professional experience, Gen Z sucks at separating the two.
  • 337 Stimmen
    19 Beiträge
    83 Aufrufe
    R
    What I'm speaking about is that it should be impossible to do some things. If it's possible, they will be done, and there's nothing you can do about it. To solve the problem of twiddled social media (and moderation used to assert dominance) we need a decentralized system of 90s Web reimagined, and Fediverse doesn't deliver it - if Facebook and Reddit are feudal states, then Fediverse is a confederation of smaller feudal entities. A post, a person, a community, a reaction and a change (by moderator or by the user) should be global entities (with global identifiers, so that the object by id of #0000001a2b3c4d6e7f890 would be the same object today or 10 years later on every server storing it) replicated over a network of servers similarly to Usenet (and to an IRC network, but in an IRC network servers are trusted, so it's not a good example for a global system). Really bad posts (or those by persons with history of posting such) should be banned on server level by everyone. The rest should be moderated by moderator reactions\changes of certain type. Ideally, for pooling of resources and resilience, servers would be separated by types into storage nodes (I think the name says it, FTP servers can do the job, but no need to be limited by it), index nodes (scraping many storage nodes, giving out results in structured format fit for any user representation, say, as a sequence of posts in one community, or like a list of communities found by tag, or ... , and possibly being connected into one DHT for Kademlia-like search, since no single index node will have everything), and (like in torrents?) tracker nodes for these and for identities, I think torrent-like announce-retrieve service is enough - to return a list of storage nodes storing, say, a specified partition (subspace of identifiers of objects, to make looking for something at least possibly efficient), or return a list of index nodes, or return a bunch of certificates and keys for an identity (should be somehow cryptographically connected to the global identifier of a person). So when a storage node comes online, it announces itself to a bunch of such trackers, similarly with index nodes, similarly with a user. One can also have a NOSTR-like service for real-time notifications by users. This way you'd have a global untrusted pooled infrastructure, allowing to replace many platforms. With common data, identities, services. Objects in storage and index services can be, say, in a format including a set of tags and then the body. So a specific application needing to show only data related to it would just search on index services and display only objects with tags of, say, "holo_ns:talk.bullshit.starwars" and "holo_t:post", like a sequence of posts with ability to comment, or maybe it would search objects with tags "holo_name:My 1999-like Star Wars holopage" and "holo_t:page" and display the links like search results in Google, and then clicking on that you'd see something presented like a webpage, except links would lead to global identifiers (or tag expressions interpreted by the particular application, who knows). (An index service may return, say, an array of objects, each with identifier, tags, list of locations on storage nodes where it's found or even bittorrent magnet links, and a free description possibly ; then the user application can unify responses of a few such services to avoid repetitions, maybe sort them, represent them as needed, so on.) The user applications for that common infrastructure can be different at the same time. Some like Facebook, some like ICQ, some like a web browser, some like a newsreader. (Star Wars is not a random reference, my whole habit of imagining tech stuff is from trying to imagine a science fiction world of the future, so yeah, this may seem like passive dreaming and it is.)
  • 149 Stimmen
    15 Beiträge
    56 Aufrufe
    M
    Don't get them wrong, they don't do this for you, or even morals. It just affects other interests too much.
  • China's Electric Vehicle Factories Have Become Tourist Hotspots

    Technology technology
    2
    1
    33 Stimmen
    2 Beiträge
    17 Aufrufe
    W
    I'd go to one. I went to Qatar and tried to find out if they did LPG tours. They don't. well at least not easily.
  • Websites Are Tracking You Via Browser Fingerprinting

    Technology technology
    41
    1
    296 Stimmen
    41 Beiträge
    169 Aufrufe
    M
    Lets you question how digital stalking is still allowed?
  • Iran asks its people to delete WhatsApp from their devices

    Technology technology
    1
    1
    0 Stimmen
    1 Beiträge
    10 Aufrufe
    Niemand hat geantwortet
  • 38 Stimmen
    7 Beiträge
    39 Aufrufe
    D
    Not easy but not hard actually really simple if you had the right energy. Just ignore this so I don't scare you.
  • Britain’s Companies Are Being Hacked

    Technology technology
    9
    1
    21 Stimmen
    9 Beiträge
    39 Aufrufe
    D
    Is that "goodbye" in Russian? Why?