Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
272 107 79
  • This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

    All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

    DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

    The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.

    Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

  • Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

    I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

  • I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

    Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

  • Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

    So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?

  • This post did not contain any content.

    please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro

  • It’s usually vastly easier to verify an answer than posit one, if you have the patience to do so.

    I usually write 3x the code to test the code itself. Verification is often harder than implementation.

    It really depends on the context. Sometimes there are domains which require solving problems in NP, but where it turns out that most of these problems are actually not hard to solve by hand with a bit of tinkering. SAT solvers might completely fail, but humans can do it. Often it turns out that this means there's a better algorithm that can exploit commanalities in the data. But a brute force approach might just be to give it to an LLM and then verify its answer. Verifying NP problems is easy.

    (This is speculation.)

  • being able to do 30% of tasks successfully is already useful.

    If you have a good testing program, it can be.

    If you use AI to write the test cases...? I wouldn't fly on that airplane.

    obviously

  • Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate.
    LLMs don't get tired and they can be run in parallel.

    The problem is they are not i.i.d., so this doesn't really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we're already looking at "agents," so they're probably already doing chain-of-thought.

  • I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It's a lot like machine translation. I speak fluent C++, but I don't speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.

    I also don't speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.

    Is this useful? When C++ is getting banned for "security concerns" and Rust is the required language, it's at least a little helpful.

    I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.

  • No, it matters. Youre pushing the lie they want pushed.

    Hitler liked to paint, doesn't make painting wrong. The fact that big tech is pushing AI isn't evidence against the utility of AI.

    That common parlance is to call machine learning "AI" these days doesn't matter to me in the slightest. Do you have a definition of "intelligence"? Do you object when pathfinding is called AI? Or STRIPS? Or bots in a video game? Dare I say it, the main difference between those AIs and LLMs is their generality -- so why not just call it GAI at this point tbh. This is a question of semantics so it really doesn't matter to the deeper question. Doesn't matter if you call it AI or not, LLMs work the same way either way.

  • So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?

    I would definitely bet it's made up and poorly designed.

    I wish that weren't the case because having actual data would be nice, but these are almost always funded with some sort of intentional slant, for example nic vape safety where they clearly don't use the product sanely and then make wild claims about how there's lead in the vapes!

    Homie you're fucking running the shit completely dry for longer then any humans could possible actually hit the vape, no shit it's producing carcinogens.

    Go burn a bunch of paper and directly inhale the smoke and tell me paper is dangerous.

  • I would definitely bet it's made up and poorly designed.

    I wish that weren't the case because having actual data would be nice, but these are almost always funded with some sort of intentional slant, for example nic vape safety where they clearly don't use the product sanely and then make wild claims about how there's lead in the vapes!

    Homie you're fucking running the shit completely dry for longer then any humans could possible actually hit the vape, no shit it's producing carcinogens.

    Go burn a bunch of paper and directly inhale the smoke and tell me paper is dangerous.

    Agreed. 70% is astoundingly high for today’s models. Something stinks.

  • We have created the overconfident intern in digital form.

    Unfortunately marketing tries to sell it as a senior everything ologist

  • DocumentDB is not for one drive documents (PDFs and such). It's for "documents" as in serialized objects (json or bson).

    That's even better, I can just jam something in before it and churn the documents through an embedding model, thanks!

  • This post did not contain any content.

    I use it for very specific tasks and give as much information as possible. I usually have to give it more feedback to get to the desired goal. For instance I will ask it how to resolve an error message. I've even asked it for some short python code. I almost always get good feedback when doing that. Asking it about basic facts works too like science questions.

    One thing I have had problems with is if the error is sort of an oddball it will give me suggestions that don't work with my OS/app version even though I gave it that info. Then I give it feedback and eventually it will loop back to its original suggestions, so it couldn't come up with an answer.

    I've also found differences in chatgpt vs MS copilot with chatgpt usually being better results.

  • please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro

    And let it suck up 10% or so of all of the power in the region.

  • The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

    Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn't show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

    I've had good results being very specific, like "Generate some python 3 code for me that converts X to Y, recursively through all subdirectories, and converts the files in place."

  • It's absolutely dangerous but it doesnt have to work even a little to do damage; hell, it already has. Your thing just makes it sound much more capable than it is. And it is not.

    Also, it's not AI.

    Edit: and in a comment replying to this one, one of your fellow fanboys proved

    everyone knows how they work

    Wrong

    the industrial revolution could be seen as dangerous, yet it brought the highest standard of living increase in centuries

  • So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?

    I mean, sure, in that the expectation is that the article is talking about AI in general. The cited paper is discussing LLMs and their ability to complete tasks. So, we have to agree that LLMs are what we mean by AI, and that their ability to complete tasks is a valid metric for AI. If we accept the marketing hype, then of course LLMs are exactly what we've been talking about with AI, and we've accepted LLMs features and limitations as what AI is. If LLMs are prone to filling in with whatever closest fits the model without regard to accuracy, by accepting LLMs as what we mean by AI, then AI fits to its model without regard to accuracy.

  • I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.

    I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: <pasted error message>" and a bit more than half of the time it is able to respond with a working fix.

  • Final Nokia feature phones coming before HMD deal ends in 2026

    Technology technology
    2
    1
    33 Stimmen
    2 Beiträge
    15 Aufrufe
    B
    HMD feature phones are such a let down. The Polish language translation within the system is clearly automated translation - the words used sometimes don't make sense. CloudFone apps are also not available in Europe. The HMD 110 4G (2024, not 2023) has the Unisoc T127 chipset which supports hotspot, but HMD deliberately chose not to include it. I know because the Itel Neo R60+ has hotspot with the same chipset. At least they made Nokia XR21 in Europe for a while.
  • 27 Stimmen
    5 Beiträge
    26 Aufrufe
    A
    it's only meant for temporary situations, 10 total days per year. I guess the idea is you'd use loaner PCs to access this while getting repairs done or before you've gotten a new PC. but yeah i kinda doubt there's a huge market for this kind of service.
  • 332 Stimmen
    35 Beiträge
    142 Aufrufe
    R
    We have batteries. But yeah, attacking the grid might be smart.
  • The British jet engine that failed in the 'Valley of Death'

    Technology technology
    16
    1
    40 Stimmen
    16 Beiträge
    64 Aufrufe
    R
    Giving up advancements in science and technology is stagnation. That's not what I'm suggesting. I'm suggesting giving up some particular, potential advancements in science and tecnology, which is a whole different kettle of fish and does not imply stagnation. Thinking it’s a good idea to not do anything until people are fed and housed is stagnation. Why do you think that?
  • Companies are using Ribbon AI, an AI interviewer to screen candidates.

    Technology technology
    52
    56 Stimmen
    52 Beiträge
    169 Aufrufe
    P
    I feel like I could succeed in an LLM selection process. I could sell my skills to a robot, could get an LLM to help. It's a long way ahead of keyword based automatic selectors At least an LLM is predictable, human judges are so variable
  • 1 Stimmen
    3 Beiträge
    5 Aufrufe
    B
    They’re trash because the entire rag is right-wing billionaire propaganda by design.
  • 0 Stimmen
    4 Beiträge
    5 Aufrufe
    K
    Only way I'll want a different phone brand is if it comes with ZERO bloatware and has an excellent internal memory/storage cleanse that has nothing to do with Google's Files or a random app I'm not sure I can trust without paying or rooting. So far my A series phones do what I need mostly and in my opinion is superior to the Motorola's my fiancé prefers minus the phone-phone charge ability his has, everything else I'm just glad I have enough control to tweak things to my liking, however these days Samsungs seem to be infested with Google bloatware and apps that insist on opening themselves back up regardless of the widespread battery restrictions I've assigned (even was sent a "Stop Closing my Apps" notif that sent me to an article ) short of Disabling many unnecessary apps bc fully rooting my devices is something I rarely do anymore. I have a random Chinese brand tablet where I actually have more control over the apps than either of my A series phones whee Force Stopping STAYS that way when I tell them to! I hate being listened to for ads and the unwanted draining my battery life and data (I live off-grid and pay data rates because "Unlimited" is some throttled BS) so my ability to control what's going on in the background matters a lot to me, enough that I'm anti Meta-apps and avoid all non-essential Google apps. I can't afford topline phones and the largest data plan, so I work with what I can afford and I'm sad refurbished A lines seem to be getting more expensive while giving away my control to companies. Last A line I bought that was supposed to be my first 5G phone was network locked, so I got ripped off, but it still serves me well in off-grid life. Only app that actually regularly malfunctions when I Force Stop it's background presence is Roku, which I find to have very an almost insidious presence in our lives. Google Play, Chrome, and Spotify never acts incompetent in any way no matter how I have to open the setting every single time I turn Airplane Mode off. Don't need Gmail with Chrome and DuckDuckGo has been awesome at intercepting self-loading ads. I hope one day DDG gets better bc Google seems to be terrible lately and I even caught their AI contradicting itself when asking about if Homo Florensis is considered Human (yes) and then asked the oldest age of human remains, and was fed the outdated narrative of 300,000 years versus 700,000+ years bipedal pre-humans have been carbon dated outside of the Cradle of Humanity in South Africa. SO sorry to go off-topic, but I've got a big gripe with Samsung's partnership with Google, especially considering the launch of Quantum Computed AI that is still being fine-tuned with company-approved censorships.
  • 0 Stimmen
    2 Beiträge
    19 Aufrufe
    G
    Wow... Just learned about that NOW. I wanted to play some games today and wondered why my account doesnt work nor the "forgot password"-Function... Fuck Meta. Fuck Oculus... Fuck this whole Enshittification that is going on lately... Is there ANY Way, to get my CV1 to run Without an account?!