Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
277 108 90
  • Maybe it is because I started out in QA, but I have to strongly disagree. You should assume the code doesn't work until proven otherwise, AI or not. Then when it doesn't work I find it is easier to debug you own code than someone else's and that includes AI.

    I've been R&D forever, so at my level the question isn't "does the code work?" we pretty much assume that will take care of itself, eventually. Our critical question is: "is the code trying to do something valuable, or not?" We make all kinds of stuff do what the requirements call for it to do, but so often those requirements are asking for worthless or even counterproductive things...

  • I've been R&D forever, so at my level the question isn't "does the code work?" we pretty much assume that will take care of itself, eventually. Our critical question is: "is the code trying to do something valuable, or not?" We make all kinds of stuff do what the requirements call for it to do, but so often those requirements are asking for worthless or even counterproductive things...

    Literally the opposite experience when I helped material scientists with their R&D. Breaking in production would mean people who get paid 2x more than me are suddenly unable to do their job. But then again, our requirements made sense because we would literally look at a manual process to automate with the engineers. What you describe sounds like hell to me. There are greener pastures.

  • Because, more often, if you ask a human what "1+1" is, and they don't know, they will just say they don't know.

    AI will confidently insist its 3, and make up math algorythms to prove it.

    And every company is pushing AI out on everyone like its always 10000% correct.

    Its also shown its not intelligent. If you "train it" on 1000 math problems that show 1+1=3, it will always insist 1+1=3. It does not actually know how to add numbers, despite being a computer.

    Haha. Sure. Humans never make up bullshit to confidently sell a fake answer.

    Fucking ridiculous.

  • Literally the opposite experience when I helped material scientists with their R&D. Breaking in production would mean people who get paid 2x more than me are suddenly unable to do their job. But then again, our requirements made sense because we would literally look at a manual process to automate with the engineers. What you describe sounds like hell to me. There are greener pastures.

    Yeah, sometimes the requirements write themselves and in those cases successful execution is "on the critical path."

    Unfortunately, our requirements are filtered from our paying customers through an ever rotating cast of Marketing and Sales characters who, nominally, are our direct customers so we make product for them - but they rarely have any clear or consistent vision of what they want, but they know they want new stuff - that's for sure.

  • Yeah, sometimes the requirements write themselves and in those cases successful execution is "on the critical path."

    Unfortunately, our requirements are filtered from our paying customers through an ever rotating cast of Marketing and Sales characters who, nominally, are our direct customers so we make product for them - but they rarely have any clear or consistent vision of what they want, but they know they want new stuff - that's for sure.

    When requirements are "Whatever" then by all means use the "Whatever" machine: https://eev.ee/blog/2025/07/03/the-rise-of-whatever/

    And then look for a better gig because such an environment is going to be toxic to your skill set. The more exacting the shop, the better they pay.

  • I'd just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time -- Amazon's new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.

    I think this comment made me finally understand the AI hate circlejerk on lemmy. If you have no clue how LLMs work and you have no idea where "AI" is coming from, it just looks like another crappy product that was thrown on the market half-ready. I guess you can only appreciate the absolutely incredible development of LLMs (and AI in general) that happened during the last ~5 years if you can actually see it in the first place.

  • I have been using AI to write (little, near trivial) programs. It's blindingly obvious that it could be feeding this code to a compiler and catching its mistakes before giving them to me, but it doesn't... yet.

    Agents do that loop pretty well now, and Claude now uses your IDE's LSP to help it code and catch errors in flow. I think Windsurf or Cursor also do that also.

    The tooling has improved a ton in the last 3 months.

  • When requirements are "Whatever" then by all means use the "Whatever" machine: https://eev.ee/blog/2025/07/03/the-rise-of-whatever/

    And then look for a better gig because such an environment is going to be toxic to your skill set. The more exacting the shop, the better they pay.

    The more exacting the shop, the better they pay.

    That hasn't been my experience, but it sounds like good advice anyway. My experience has been that the more profitable the parent company, the better the job security and the better the pay too. Once "in," tune in to the culture and align with the people at your level and above who seem like they'll be sticking around long term. If the company isn't financially secure, all bets are off and you should be seeking, and taking, a better offer when you can find one.

    I knocked around startups for 10/22 years (depending on how you characterize that one 12 year gig that ended with everybody laid off...) The pay was good enough, but job security just wasn't on the menu. Finally, one got bought by a big fish and I've been in the belly of the beast for 11 years now.

  • I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

    In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.

    😆 I can't believe how absolutely silly a lot of you sound with this.

    LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

    It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

    Also another person commented that seen the pattern you also see means we're psychotic.

    All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.

  • LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn't need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

    It is truly terrible marketing. It's been obvious to me for years the value is in giving it to people and enabling them to do more with less, not outright replacing humans, especially not expert humans.

    I use AI/LLMs pretty much every day now. I write MCP servers and automate things with it and it's mind blowing how productive it makes me.

    Just today I used these tools in a highly supervised way to complete a task that would have been a full day of tedius work, all done in an hour. That is fucking fantastic, it's means I get to spend that time on more important things.

    It's like giving an accountant excel. Excel isn't replacing them, but it's taking care of specific tasks so they can focus on better things.

    On the reliability and accuracy front there is still a lot to be desired, sure. But for supervised chats where it's calling my tools it's pretty damn good.

  • than reading an actual intro on an unfamiliar topic

    The LLM helps me know what to look for in order to find that unfamiliar topic.

    For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

    Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

    I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

    It’s going to say something plausible, and you tautologically are not in a position to verify it.

    I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

    I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

    The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

    general search on the web?

    I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.

    It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

    More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.

  • than reading an actual intro on an unfamiliar topic

    The LLM helps me know what to look for in order to find that unfamiliar topic.

    For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

    Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

    I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

    It’s going to say something plausible, and you tautologically are not in a position to verify it.

    I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

    I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

    The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

    general search on the web?

    I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.

    One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

    If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.

  • I tried to dictate some documents recently without paying the big bucks for specialized software, and was surprised just how bad Google and Microsoft's speech recognition still is. Then I tried getting Word to transcribe some audio talks I had recorded, and that resulted in unreadable stuff with punctuation in all the wrong places. You could just about make out what it meant to say, so I tried asking various LLMs to tidy it up. That resulted in readable stuff that was largely made up and wrong, which also left out large chunks of the source material. In the end I just had to transcribe it all by hand.

    It surprised me that these AI-ish products are still unable to transcribe speech coherently or tidy up a messy document without changing the meaning.

    I don't know basic solutions that are super good, but whisper sbd the whisper derivatives I hear are decent for dictation these days.

    I have no idea how to run then though.

  • It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

    More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.

    And that's the same w/ traditional search engines, the difference is that we're used to search engines and LLMs are new. Learn how to use the tool and decide for yourself when it's useful.

  • One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

    If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.

    Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.

  • Were you prone to this weird leaps of logic before your brain was fried by talking to LLMs, or did you start being a fan of talking to LLMs because your ability to logic was...well...that?

    You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

    Copilot reply:

    Certainly! Here’s an explanation Person B could consider:

    The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

    However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.

  • 😆 I can't believe how absolutely silly a lot of you sound with this.

    LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

    It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

    Also another person commented that seen the pattern you also see means we're psychotic.

    All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.

    If that’s the quality of answer you’re getting, then it’s a user error

    No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

    You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

    You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?

  • and? we can understand 256 where AI can't, that's the point.

    The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.

  • So the chances of it being right ten times in a row are 2%.

  • For me as a software developer the accuracy is more in the 95%+ range.

    On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.

    On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.

    Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so

    Inser into (column1,....,column n)
    Values row1,
    Row 2
    Row n

    Addig a new column with test data for each row is a PITA, but copilot handles it without issue.

    Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.

    So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.

    For your database test data, I usually write a helper that defaults those columns to base values, so I can pass in lists of dictionaries, then the test cases are easier to modify and read.

    It's also nice because you're only including the fields you use in your unit test, the rest are default valid you don't need to care about.