Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
277 108 90
  • Why are you giving it data. It's a chat and language tool. It's not data based. You need something trained to work for that specific use. I think Wolfram Alpha has better tools for that.

    I wouldn't trust it to calculate how many patio stones I need to build a project. But I trust it to tell me where a good source is on a topic or if a quote was said by who ever or if I need to remember something but I only have vague pieces like old timey historical witch burning related factoid about villagers who pulled people through a hole in the church wall or what was a the princess who was skeptic and sent her scientist to villages to try to calm superstitious panic .

    Other uses are like digging around my computer and seeing what processes do what. How concepts work regarding the think I'm currently learning. So many excellent users. But I fucking wouldn't trust it to do any kind of calculation.

    Why are you giving it data

    Because there's a button for that.

    It’s output is dependent on the input

    This thing that you said... It's false.

  • This post did not contain any content.

    Wow. 30% accuracy was the high score!
    From the article:

    Testing agents at the office

    For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

    They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

    the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

    ⚫ Gemini-2.5-Pro (30.3 percent)
    ⚫ Claude-3.7-Sonnet (26.3 percent)
    ⚫ Claude-3.5-Sonnet (24 percent)
    ⚫ Gemini-2.0-Flash (11.4 percent)
    ⚫ GPT-4o (8.6 percent)
    ⚫ o3-mini (4.0 percent)
    ⚫ Gemini-1.5-Pro (3.4 percent)
    ⚫ Amazon-Nova-Pro-v1 (1.7 percent)
    ⚫ Llama-3.1-405b (7.4 percent)
    ⚫ Llama-3.3-70b (6.9 percent),
    ⚫ Qwen-2.5-72b (5.7 percent),
    ⚫ Llama-3.1-70b (1.7 percent)
    ⚫ Qwen-2-72b (1.1 percent).

    "We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

  • Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

    so the chances of it being right ten times in a row are less than one thousandth of a percent.

    No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.

    That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

    And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

  • You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day 😀

    Did the autocomplete told you to answer this? Don't answer, actually, save some energy.

  • This post did not contain any content.

    Now I'm curious, what's the average score for humans?

  • The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.

    you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up

  • This post did not contain any content.

    I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It's a lot of work. I stopped caring and moved on.

    For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

    Colour me unimpressed. I dread the day when they force the use of 'AI' on us at work.

  • Why are you giving it data

    Because there's a button for that.

    It’s output is dependent on the input

    This thing that you said... It's false.

    There's a sleep button on my laptop. Doesn't mean I would use it.

    I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

    I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.

  • America: "Good enough to handle 911 calls!"

    Is there really a plan to use this for 911 services??

  • Wow. 30% accuracy was the high score!
    From the article:

    Testing agents at the office

    For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

    They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

    the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

    ⚫ Gemini-2.5-Pro (30.3 percent)
    ⚫ Claude-3.7-Sonnet (26.3 percent)
    ⚫ Claude-3.5-Sonnet (24 percent)
    ⚫ Gemini-2.0-Flash (11.4 percent)
    ⚫ GPT-4o (8.6 percent)
    ⚫ o3-mini (4.0 percent)
    ⚫ Gemini-1.5-Pro (3.4 percent)
    ⚫ Amazon-Nova-Pro-v1 (1.7 percent)
    ⚫ Llama-3.1-405b (7.4 percent)
    ⚫ Llama-3.3-70b (6.9 percent),
    ⚫ Qwen-2.5-72b (5.7 percent),
    ⚫ Llama-3.1-70b (1.7 percent)
    ⚫ Qwen-2-72b (1.1 percent).

    "We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

    sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right

  • sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right

    Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?

  • you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up

    No worries.

  • This post did not contain any content.

    Why would they be right beyond word sequence frecuencies?

  • There's a sleep button on my laptop. Doesn't mean I would use it.

    I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

    I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.

    Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

  • That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

    And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

    Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

  • Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

    What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

  • Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

    Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

  • What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

    I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

  • Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

    You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

    It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

    If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.

  • This post did not contain any content.

    Reading with CEO mindset. 3 out of 10 employees can be fired.

  • How could AI escape human control?

    Technology technology
    5
    6 Stimmen
    5 Beiträge
    28 Aufrufe
    Z
    Don't mix up country bosses with technology bosses - even if they have the same brain damages.
  • How to "Reformat" a Hardrive the American way

    Technology technology
    25
    2
    90 Stimmen
    25 Beiträge
    102 Aufrufe
    T
    It really, really is. Like that scene from Office Space.
  • 216 Stimmen
    13 Beiträge
    40 Aufrufe
    J
    It’s DEI’s fault!
  • 66 Stimmen
    8 Beiträge
    14 Aufrufe
    erasmus@lemmy.worldE
    The Convergiance is beginning. Altman Be Praised!!
  • 903 Stimmen
    179 Beiträge
    440 Aufrufe
    K
    Most jokes need to be recognizable as funny? Like if you say the word cucked, ever, I'm going to assume you're serious and an imbecile and I would be right to do that, no?!
  • 7 Stimmen
    6 Beiträge
    31 Aufrufe
    db0@lemmy.dbzer0.comD
    VC-backed OpenAI is the most valuable company in the world and is engaging in massive environmental destruction. The US state just went into cahoots with them to the tune of billions VC-backed Uber and AirBnb disrupted multiple estabilished industries for the worst by undercutting them through loss-leading. VC-backed Facebook killed or purchased all its rivals and consolidated almost all social media to the detriment of the whole world.
  • Catbox.moe got screwed 😿

    Technology technology
    40
    55 Stimmen
    40 Beiträge
    133 Aufrufe
    archrecord@lemm.eeA
    I'll gladly give you a reason. I'm actually happy to articulate my stance on this, considering how much I tend to care about digital rights. Services that host files should not be held responsible for what users upload, unless: The service explicitly caters to illegal content by definition or practice (i.e. the if the website is literally titled uploadyourcsamhere[.]com then it's safe to assume they deliberately want to host illegal content) The service has a very easy mechanism to remove illegal content, either when asked, or through simple monitoring systems, but chooses not to do so (catbox does this, and quite quickly too) Because holding services responsible creates a whole host of negative effects. Here's some examples: Someone starts a CDN and some users upload CSAM. The creator of the CDN goes to jail now. Nobody ever wants to create a CDN because of the legal risk, and thus the only providers of CDNs become shady, expensive, anonymously-run services with no compliance mechanisms. You run a site that hosts images, and someone decides they want to harm you. They upload CSAM, then report the site to law enforcement. You go to jail. Anybody in the future who wants to run an image sharing site must now self-censor to try and not upset any human being that could be willing to harm them via their site. A social media site is hosting the posts and content of users. In order to be compliant and not go to jail, they must engage in extremely strict filtering, otherwise even one mistake could land them in jail. All users of the site are prohibited from posting any NSFW or even suggestive content, (including newsworthy media, such as an image of bodies in a warzone) and any violation leads to an instant ban, because any of those things could lead to a chance of actually illegal content being attached. This isn't just my opinion either. Digital rights organizations such as the Electronic Frontier Foundation have talked at length about similar policies before. To quote them: "When social media platforms adopt heavy-handed moderation policies, the unintended consequences can be hard to predict. For example, Twitter’s policies on sexual material have resulted in posts on sexual health and condoms being taken down. YouTube’s bans on violent content have resulted in journalism on the Syrian war being pulled from the site. It can be tempting to attempt to “fix” certain attitudes and behaviors online by placing increased restrictions on users’ speech, but in practice, web platforms have had more success at silencing innocent people than at making online communities healthier." Now, to address the rest of your comment, since I don't just want to focus on the beginning: I think you have to actively moderate what is uploaded Catbox does, and as previously mentioned, often at a much higher rate than other services, and at a comparable rate to many services that have millions, if not billions of dollars in annual profits that could otherwise be spent on further moderation. there has to be swifter and stricter punishment for those that do upload things that are against TOS and/or illegal. The problem isn't necessarily the speed at which people can be reported and punished, but rather that the internet is fundamentally harder to track people on than real life. It's easy for cops to sit around at a spot they know someone will be physically distributing illegal content at in real life, but digitally, even if you can see the feed of all the information passing through the service, a VPN or Tor connection will anonymize your IP address in a manner that most police departments won't be able to track, and most three-letter agencies will simply have a relatively low success rate with. There's no good solution to this problem of identifying perpetrators, which is why platforms often focus on moderation over legal enforcement actions against users so frequently. It accomplishes the goal of preventing and removing the content without having to, for example, require every single user of the internet to scan an ID (and also magically prevent people from just stealing other people's access tokens and impersonating their ID) I do agree, however, that we should probably provide larger amounts of funding, training, and resources, to divisions who's sole goal is to go after online distribution of various illegal content, primarily that which harms children, because it's certainly still an issue of there being too many reports to go through, even if many of them will still lead to dead ends. I hope that explains why making file hosting services liable for user uploaded content probably isn't the best strategy. I hate to see people with good intentions support ideas that sound good in practice, but in the end just cause more untold harms, and I hope you can understand why I believe this to be the case.
  • 21 Stimmen
    3 Beiträge
    21 Aufrufe
    B
    We have to do this ourselves in the government for every decommissioned server/appliance/end user device. We have to fill out paperwork for every single storage drive we destroy, and we can only destroy them using approved destruction tools (e.g. specific degaussers, drive shredders/crushers, etc). Appliances can be kind of a pain, though. It can be tricky sometimes finding all the writable memory in things like switches and routers. But, nothing is worse than storage arrays... destroying hundreds of drives is incredibly tedious.