Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
277 108 90
  • The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.

    you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up

  • This post did not contain any content.

    I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It's a lot of work. I stopped caring and moved on.

    For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

    Colour me unimpressed. I dread the day when they force the use of 'AI' on us at work.

  • Why are you giving it data

    Because there's a button for that.

    It’s output is dependent on the input

    This thing that you said... It's false.

    There's a sleep button on my laptop. Doesn't mean I would use it.

    I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

    I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.

  • America: "Good enough to handle 911 calls!"

    Is there really a plan to use this for 911 services??

  • Wow. 30% accuracy was the high score!
    From the article:

    Testing agents at the office

    For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

    They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

    the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

    ⚫ Gemini-2.5-Pro (30.3 percent)
    ⚫ Claude-3.7-Sonnet (26.3 percent)
    ⚫ Claude-3.5-Sonnet (24 percent)
    ⚫ Gemini-2.0-Flash (11.4 percent)
    ⚫ GPT-4o (8.6 percent)
    ⚫ o3-mini (4.0 percent)
    ⚫ Gemini-1.5-Pro (3.4 percent)
    ⚫ Amazon-Nova-Pro-v1 (1.7 percent)
    ⚫ Llama-3.1-405b (7.4 percent)
    ⚫ Llama-3.3-70b (6.9 percent),
    ⚫ Qwen-2.5-72b (5.7 percent),
    ⚫ Llama-3.1-70b (1.7 percent)
    ⚫ Qwen-2-72b (1.1 percent).

    "We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

    sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right

  • sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right

    Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?

  • you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up

    No worries.

  • This post did not contain any content.

    Why would they be right beyond word sequence frecuencies?

  • There's a sleep button on my laptop. Doesn't mean I would use it.

    I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

    I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.

    Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

  • That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

    And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

    Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

  • Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

    What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

  • Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

    Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

  • What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

    I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

  • Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

    You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

    It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

    If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.

  • This post did not contain any content.

    Reading with CEO mindset. 3 out of 10 employees can be fired.

  • Firefox is fine. The people running it are not

    Technology technology
    141
    1
    681 Stimmen
    141 Beiträge
    0 Aufrufe
    E
    IronFox is really good on mobile.
  • Former and current Microsofties react to the latest layoffs

    Technology technology
    20
    1
    85 Stimmen
    20 Beiträge
    40 Aufrufe
    eightbitblood@lemmy.worldE
    Incredibly well said. And couldn't agree more! Especially after working as a game dev for Apple Arcade. We spent months proving to them their saving architecture was faulty and would lead to people losing their save file for each Apple Arcade game they play. We were ignored, and then told it was a dev problem. Cut to the launch of Arcade: every single game has several 1 star reviews about players losing their save files. This cannot be fixed by devs as it's an Apple problem, so devs have to figure out novel ways to prevent the issue from happening using their own time and resources. 1.5 years later, Apple finishes restructuring the entire backend of Arcade, fixing the problem. They tell all their devs to reimplement the saving architecture of their games to be compliant with Apples new backend or get booted from Arcade. This costs devs months of time to complete for literally zero return (Apple Arcade deals are upfront - little to no revenue is seen after launch). Apple used their trillions of dollars to ignore a massive backend issue that affected every player and developer on Apple Arcade. They then forced every dev to make an update to their game at their own expense just to keep it listed on Arcade. All while directing user frustration over the issue towards developers instead of taking accountability for launching a faulty product. Literally, these companies are run by sociopaths that have egos bigger than their paychecks. Issues like this are ignored as it's easier to place the blame on someone down the line. People like your manager end up getting promoted to the top of an office heirachy of bullshit, and everything the company makes just gets worse until whatever corpse is left is sold for parts to whatever bigger dumb company hasn't collapsed yet. It's really painful to watch, and even more painful to work with these idiots.
  • Firefox 140 Brings Tab Unload, Custom Search & New ESR

    Technology technology
    41
    1
    234 Stimmen
    41 Beiträge
    165 Aufrufe
    S
    Read again. I quoted something along the lines of "just as much a development decision as a marketing one" and I said, it wasn't a development decision, so what's left? Firefox released just as frequently before, just that they didn’t increase the major version that often. This does not appear to be true. Why don't you take a look at the version history instead of some marketing blog post? https://www.mozilla.org/en-US/firefox/releases/ Version 2 had 20 releases within 730 days, averaging one release every 36.5 days. Version 3 had 19 releases within 622 days, averaging 32.7 days per release. But these releases were unscheduled, so they were released when they were done. Now they are on a fixed 90-day schedule, no matter if anything worthwhile was complete or not, plus hotfix releases whenever they are necessary. That's not faster, but instead scheduled, and also they are incrementing the major version even if no major change was included. That's what the blog post was alluding to. In the before times, a major version number increase indicated major changes. Now it doesn't anymore, which means sysadmins still need to consider each release a major release, even if it doesn't contain major changes because it might contain them and the version name doesn't say anything about whether it does or not. It's nothing but a marketing change, moving from "version numbering means something" to "big number go up".
  • Websites Are Tracking You Via Browser Fingerprinting

    Technology technology
    41
    1
    296 Stimmen
    41 Beiträge
    167 Aufrufe
    M
    Lets you question how digital stalking is still allowed?
  • Diego

    Technology technology
    1
    1
    0 Stimmen
    1 Beiträge
    8 Aufrufe
    Niemand hat geantwortet
  • Trump Taps Palantir to Compile Data on Americans

    Technology technology
    34
    1
    205 Stimmen
    34 Beiträge
    125 Aufrufe
    M
    Well if they're collating data, not that difficult to add a new table for gun ownership.
  • 88 Stimmen
    21 Beiträge
    82 Aufrufe
    J
    The self hosted model has hard coded censored content.
  • 1 Stimmen
    3 Beiträge
    5 Aufrufe
    B
    They’re trash because the entire rag is right-wing billionaire propaganda by design.