Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
278 108 123
  • This post did not contain any content.

    And it won’t be until humans can agree on what’s a fact and true vs not.. there is always someone or some group spreading mis/dis-information

  • If that’s the quality of answer you’re getting, then it’s a user error

    No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

    You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

    You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?

    Why are you giving it data. It's a chat and language tool. It's not data based. You need something trained to work for that specific use. I think Wolfram Alpha has better tools for that.

    I wouldn't trust it to calculate how many patio stones I need to build a project. But I trust it to tell me where a good source is on a topic or if a quote was said by who ever or if I need to remember something but I only have vague pieces like old timey historical witch burning related factoid about villagers who pulled people through a hole in the church wall or what was a the princess who was skeptic and sent her scientist to villages to try to calm superstitious panic .

    Other uses are like digging around my computer and seeing what processes do what. How concepts work regarding the think I'm currently learning. So many excellent users. But I fucking wouldn't trust it to do any kind of calculation.

  • You probably wanted to show off how smart you are, but instead you showed that you can't even talk to people without help of your favourite slop bucket.
    It didn't answer my curiosity about what came first, but it solidified my conviction that your brain is cooked all the way, probably beyond repair. I would say you need to seek professional help, but at this point you would interpret it as needing to talk to the autocomplete, and it will cook you even more.
    It started funny, but I feel very sorry for you now, and it sucked all the humour out.

    You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day 😀

  • I actually have a fairly positive experience with ai ( copilot using claude specificaly ). Is it wrong a lot if you give it a huge task yes, so i dont do that and using as a very targeted solution if i am feeling very lazy today . Is it fast . Also not . I could actually be faster than ai in some cases.
    But is it good if you are working for 6h and you just dont have enough mental capacity for the rest of the day. Yes . You can just prompt it specificaly enough to get desired result and just accept correct responses. Is it always good ,not really but good enough. Do i also suck after 3pm . Yes.
    My main issue is actually the fact that it saves first and then asks you to pick if you want to use it. Not a problem usualy but if it crashes the generated code stays so that part sucks

    You should give Claude Code a shot if you have a Claude subscription. I'd say this is where AI actually does a decent job: picking up human slack, under supervision, not replacing humans at anything. AI tools won't suddenly be productive enough to employ, but I as a professional can use it to accelerate my own workflow. It's actually where the risk of them taking jobs is real: for example, instead of 10 support people you can have 2 who just supervise the responses of an AI.

    But of course, the Devil's in the detail. The only reason this is cost effective is because of VC money subsidizing and hiding the real cost of running these models.

  • Why are you giving it data. It's a chat and language tool. It's not data based. You need something trained to work for that specific use. I think Wolfram Alpha has better tools for that.

    I wouldn't trust it to calculate how many patio stones I need to build a project. But I trust it to tell me where a good source is on a topic or if a quote was said by who ever or if I need to remember something but I only have vague pieces like old timey historical witch burning related factoid about villagers who pulled people through a hole in the church wall or what was a the princess who was skeptic and sent her scientist to villages to try to calm superstitious panic .

    Other uses are like digging around my computer and seeing what processes do what. How concepts work regarding the think I'm currently learning. So many excellent users. But I fucking wouldn't trust it to do any kind of calculation.

    Why are you giving it data

    Because there's a button for that.

    It’s output is dependent on the input

    This thing that you said... It's false.

  • This post did not contain any content.

    Wow. 30% accuracy was the high score!
    From the article:

    Testing agents at the office

    For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

    They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

    the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

    ⚫ Gemini-2.5-Pro (30.3 percent)
    ⚫ Claude-3.7-Sonnet (26.3 percent)
    ⚫ Claude-3.5-Sonnet (24 percent)
    ⚫ Gemini-2.0-Flash (11.4 percent)
    ⚫ GPT-4o (8.6 percent)
    ⚫ o3-mini (4.0 percent)
    ⚫ Gemini-1.5-Pro (3.4 percent)
    ⚫ Amazon-Nova-Pro-v1 (1.7 percent)
    ⚫ Llama-3.1-405b (7.4 percent)
    ⚫ Llama-3.3-70b (6.9 percent),
    ⚫ Qwen-2.5-72b (5.7 percent),
    ⚫ Llama-3.1-70b (1.7 percent)
    ⚫ Qwen-2-72b (1.1 percent).

    "We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

  • Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

    so the chances of it being right ten times in a row are less than one thousandth of a percent.

    No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.

    That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

    And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

  • You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day 😀

    Did the autocomplete told you to answer this? Don't answer, actually, save some energy.

  • This post did not contain any content.

    Now I'm curious, what's the average score for humans?

  • The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.

    you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up

  • This post did not contain any content.

    I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It's a lot of work. I stopped caring and moved on.

    For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

    Colour me unimpressed. I dread the day when they force the use of 'AI' on us at work.

  • Why are you giving it data

    Because there's a button for that.

    It’s output is dependent on the input

    This thing that you said... It's false.

    There's a sleep button on my laptop. Doesn't mean I would use it.

    I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

    I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.

  • America: "Good enough to handle 911 calls!"

    Is there really a plan to use this for 911 services??

  • Wow. 30% accuracy was the high score!
    From the article:

    Testing agents at the office

    For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

    They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

    the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

    ⚫ Gemini-2.5-Pro (30.3 percent)
    ⚫ Claude-3.7-Sonnet (26.3 percent)
    ⚫ Claude-3.5-Sonnet (24 percent)
    ⚫ Gemini-2.0-Flash (11.4 percent)
    ⚫ GPT-4o (8.6 percent)
    ⚫ o3-mini (4.0 percent)
    ⚫ Gemini-1.5-Pro (3.4 percent)
    ⚫ Amazon-Nova-Pro-v1 (1.7 percent)
    ⚫ Llama-3.1-405b (7.4 percent)
    ⚫ Llama-3.3-70b (6.9 percent),
    ⚫ Qwen-2.5-72b (5.7 percent),
    ⚫ Llama-3.1-70b (1.7 percent)
    ⚫ Qwen-2-72b (1.1 percent).

    "We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

    sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right

  • sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right

    Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?

  • you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up

    No worries.

  • This post did not contain any content.

    Why would they be right beyond word sequence frecuencies?

  • There's a sleep button on my laptop. Doesn't mean I would use it.

    I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

    I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.

    Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

  • That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

    And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

    Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

  • Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

    What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

  • 42 Stimmen
    11 Beiträge
    53 Aufrufe
    P
    That takes zero ingenuity.
  • China is rushing to develop its AI-powered censorship system

    Technology technology
    2
    1
    39 Stimmen
    2 Beiträge
    15 Aufrufe
    why0y@lemmy.mlW
    This concept is the enemy of the a centuries old idealistic societal pillar of the West: Liberté, Libertas... this has blessed so many of us in the West, and I beg that it doesn't leave. Something beautiful and as sacred as the freedom from forced labor and the freedom to choose your trade, is the concept of the free and unbounded innocence of voices asking their leaders and each other these questions, to determine amongst ourselves what is fair and not, for our own betterment and the beauty of free enterprise. It's not so much that the Chinese state is an awful power to behold (it is and fuck Poohhead)... but this same politic is on the rise in the West and it leads to war. It always leads to war. And now the most automated form of state and corporate propaganda the world has ever seen is in the hands of a ruthless ruling class that can, has, and will steal bread from children's hands, and literally take the medicine from the sick to pad their pockets. Such is the twisted fate of society and likely always will be. We need to fight and not with prayers; this moment is God forsaking us to behold how the spirit breaks and what the people want to fight for as ruthlessly as the others do to steal our bread.
  • 9 Stimmen
    6 Beiträge
    29 Aufrufe
    F
    You said it yourself: extra places that need human attention ... those need ... humans, right? It's easy to say "let AI find the mistakes". But that tells us nothing at all. There's no substance. It's just a sales pitch for snake oil. In reality, there are various ways one can leverage technology to identify various errors, but that only happens through the focused actions of people who actually understand the details of what's happening. And think about it here. We already have computer systems that monitor patients' real-time data when they're hospitalized. We already have systems that check for allergies in prescribed medication. We already have systems for all kinds of safety mechanisms. We're already using safety tech in hospitals, so what can be inferred from a vague headline about AI doing something that's ... checks notes ... already being done? ... Yeah, the safe money is that it's just a scam.
  • 0 Stimmen
    1 Beiträge
    9 Aufrufe
    Niemand hat geantwortet
  • 133 Stimmen
    10 Beiträge
    16 Aufrufe
    01189998819991197253@infosec.pub0
    we're at war with eastasia. We've always been at war with eastasia. Big Brother Really has "trust me bro" energy.
  • 179 Stimmen
    9 Beiträge
    43 Aufrufe
    R
    They've probably just crunched the numbers and determined the cost of a recall in Canada was greater than the cost of law suits when your house does burn down
  • UK government withholding details of Palantir contract

    Technology technology
    3
    1
    15 Stimmen
    3 Beiträge
    22 Aufrufe
    T
    Of all the partners you could have picked. Eek.
  • [paper] Evidence of a social evaluation penalty for using AI

    Technology technology
    10
    28 Stimmen
    10 Beiträge
    56 Aufrufe
    vendetta9076@sh.itjust.worksV
    I'm specifically talking about toil when it comes to my job as a software developer. I already know I need an if statement and a for loop all wrapped in a try catch. Rather then spending a couple minutes coding that I have cursor do it for me instantly then fill out the actual code. Or, ive written something in python and it needs to be converted to JavaScript. I can ask Claude to convert it one to one for me and test it, which comes back with either no errors or a very simple error I need to fix. It takes a minute. Instead I could have taken 15min to rewrite it myself and maybe make more mistakes that take longer.