Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
259 101 5
  • LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn't need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

    It is truly terrible marketing. It's been obvious to me for years the value is in giving it to people and enabling them to do more with less, not outright replacing humans, especially not expert humans.

    I use AI/LLMs pretty much every day now. I write MCP servers and automate things with it and it's mind blowing how productive it makes me.

    Just today I used these tools in a highly supervised way to complete a task that would have been a full day of tedius work, all done in an hour. That is fucking fantastic, it's means I get to spend that time on more important things.

    It's like giving an accountant excel. Excel isn't replacing them, but it's taking care of specific tasks so they can focus on better things.

    On the reliability and accuracy front there is still a lot to be desired, sure. But for supervised chats where it's calling my tools it's pretty damn good.

  • than reading an actual intro on an unfamiliar topic

    The LLM helps me know what to look for in order to find that unfamiliar topic.

    For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

    Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

    I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

    It’s going to say something plausible, and you tautologically are not in a position to verify it.

    I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

    I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

    The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

    general search on the web?

    I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.

    It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

    More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.

  • than reading an actual intro on an unfamiliar topic

    The LLM helps me know what to look for in order to find that unfamiliar topic.

    For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

    Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

    I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

    It’s going to say something plausible, and you tautologically are not in a position to verify it.

    I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

    I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

    The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

    general search on the web?

    I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.

    One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

    If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.

  • I tried to dictate some documents recently without paying the big bucks for specialized software, and was surprised just how bad Google and Microsoft's speech recognition still is. Then I tried getting Word to transcribe some audio talks I had recorded, and that resulted in unreadable stuff with punctuation in all the wrong places. You could just about make out what it meant to say, so I tried asking various LLMs to tidy it up. That resulted in readable stuff that was largely made up and wrong, which also left out large chunks of the source material. In the end I just had to transcribe it all by hand.

    It surprised me that these AI-ish products are still unable to transcribe speech coherently or tidy up a messy document without changing the meaning.

    I don't know basic solutions that are super good, but whisper sbd the whisper derivatives I hear are decent for dictation these days.

    I have no idea how to run then though.

  • It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

    More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.

    And that's the same w/ traditional search engines, the difference is that we're used to search engines and LLMs are new. Learn how to use the tool and decide for yourself when it's useful.

  • One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

    If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.

    Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.

  • Were you prone to this weird leaps of logic before your brain was fried by talking to LLMs, or did you start being a fan of talking to LLMs because your ability to logic was...well...that?

    You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

    Copilot reply:

    Certainly! Here’s an explanation Person B could consider:

    The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

    However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.

  • 😆 I can't believe how absolutely silly a lot of you sound with this.

    LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

    It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

    Also another person commented that seen the pattern you also see means we're psychotic.

    All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.

    If that’s the quality of answer you’re getting, then it’s a user error

    No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

    You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

    You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?

  • and? we can understand 256 where AI can't, that's the point.

    The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.

  • So the chances of it being right ten times in a row are 2%.

  • For me as a software developer the accuracy is more in the 95%+ range.

    On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.

    On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.

    Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so

    Inser into (column1,....,column n)
    Values row1,
    Row 2
    Row n

    Addig a new column with test data for each row is a PITA, but copilot handles it without issue.

    Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.

    So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.

    For your database test data, I usually write a helper that defaults those columns to base values, so I can pass in lists of dictionaries, then the test cases are easier to modify and read.

    It's also nice because you're only including the fields you use in your unit test, the rest are default valid you don't need to care about.

  • Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.

    Definitely, I'm just trying to share a foot gun I've accidentally triggered myself!

  • So the chances of it being right ten times in a row are 2%.

    No the chances of being wrong 10x in a row are 2%. So the chances of being at least right once are 98%.

  • Jan Refiner is up there for me.

    I just arrived at act 2, and he wasn't one of the four I've unlocked...

  • I think this comment made me finally understand the AI hate circlejerk on lemmy. If you have no clue how LLMs work and you have no idea where "AI" is coming from, it just looks like another crappy product that was thrown on the market half-ready. I guess you can only appreciate the absolutely incredible development of LLMs (and AI in general) that happened during the last ~5 years if you can actually see it in the first place.

    The notion that AI is half-ready is a really poignant observation actually. It's ready for select applications only, but it's really being advertised like it's idiot-proof and ready for general use.

  • You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

    Copilot reply:

    Certainly! Here’s an explanation Person B could consider:

    The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

    However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.

    You probably wanted to show off how smart you are, but instead you showed that you can't even talk to people without help of your favourite slop bucket.
    It didn't answer my curiosity about what came first, but it solidified my conviction that your brain is cooked all the way, probably beyond repair. I would say you need to seek professional help, but at this point you would interpret it as needing to talk to the autocomplete, and it will cook you even more.
    It started funny, but I feel very sorry for you now, and it sucked all the humour out.

  • How do I set up event driven document ingestion from OneDrive located on an Azure tenant to Amazon DocumentDB? Ingestion must be near-realtime, durable, and have some form of DLQ.

    @Shayeta
    You might have a look at for the ingress part
    @criss_cross

  • This post did not contain any content.

    I notice that the research didn't include DeepSeek. It would have been nice to see how it compares.

  • No the chances of being wrong 10x in a row are 2%. So the chances of being at least right once are 98%.

    don’t you dare understand the explicitly obvious reasons this technology can be useful and the essential differences between P and NP problems. why won’t you be angry 😠

  • I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: <pasted error message>" and a bit more than half of the time it is able to respond with a working fix.

    i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

    i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.

  • Google Introduced a New Way to Use Search. Proceed With Caution.

    Technology technology
    8
    1
    33 Stimmen
    8 Beiträge
    13 Aufrufe
    desmosthenes@lemmy.worldD
    sponsored content lol
  • 93 Stimmen
    8 Beiträge
    36 Aufrufe
    E
    It can be hard to guess who to bribe, or how big each bribe should be?
  • 51 Stimmen
    8 Beiträge
    35 Aufrufe
    B
    But do you also sometimes leave out AI for steps the AI often does for you, like the conceptualisation or the implementation? Would it be possible for you to do these steps as efficiently as before the use of AI? Would you be able to spot the mistakes the AI makes in these steps, even months or years along those lines? The main issue I have with AI being used in tasks is that it deprives you from using logic by applying it to real life scenarios, the thing we excel at. It would be better to use AI in the opposite direction you are currently use it as: develop methods to view the works critically. After all, if there is one thing a lot of people are bad at, it's thorough critical thinking. We just suck at knowing of all edge cases and how we test for them. Let the AI come up with unit tests, let it be the one that questions your work, in order to get a better perspective on it.
  • 0 Stimmen
    1 Beiträge
    8 Aufrufe
    Niemand hat geantwortet
  • Why Silicon Valley Needs Immigration

    Technology technology
    4
    1
    36 Stimmen
    4 Beiträge
    24 Aufrufe
    anarch157a@lemmy.dbzer0.comA
    "Because theyŕe greedy fucks". There, saved you a click.
  • Is Washington state falling out of love with Tesla?

    Technology technology
    10
    1
    61 Stimmen
    10 Beiträge
    37 Aufrufe
    B
    These Tesla owners who love their cars but hate his involvement with government are a bit ridiculous because one of the biggest reasons he got in loved with shilling for the right is that the government was looking into regulations and investigations concerning how unsafe Tesla cars are.
  • 44 Stimmen
    4 Beiträge
    24 Aufrufe
    G
    It varies based on local legislation, so in some places paying ransoms is banned but it's by no means universal. It's totally valid to be against paying ransoms wherever possible, but it's not entirely black and white in some situations. For example, what if a hospital gets ransomed? Say they serve an area not served by other facilities, and if they can't get back online quickly people will die? Sounds dramatic, but critical public services get ransomed all the time and there are undeniable real world consequences. Recovery from ransomware can cost significantly more than a ransom payment if you're not prepared. It can also take months to years to recover, especially if you're simultaneously fighting to evict a persistent (annoyed, unpaid) threat actor from your environment. For the record I don't think ransoms should be paid in most scenarios, but I do think there is some nuance to consider here.
  • Microsoft's AI Secretly Copying All Your Private Messages

    Technology technology
    4
    1
    0 Stimmen
    4 Beiträge
    25 Aufrufe
    S
    Forgive me for not explaining better. Here are the terms potentially needing explanation. Provisioning in this case is initial system setup, the kind of stuff you would do manually after a fresh install, but usually implies a regimented and repeatable process. Virtual Machine (VM) snapshots are like a save state in a game, and are often used to reset a virtual machine to a particular known-working condition. Preboot Execution Environment (PXE, aka ‘network boot’) is a network adapter feature that lets you boot a physical machine from a hosted network image rather than the usual installation on locally attached storage. It’s probably tucked away in your BIOS settings, but many computers have the feature since it’s a common requirement in commercial deployments. As with the VM snapshot described above, a PXE image is typically a known-working state that resets on each boot. Non-virtualized means not using hardware virtualization, and I meant specifically not running inside a virtual machine. Local-only means without a network or just not booting from a network-hosted image. Telemetry refers to data collecting functionality. Most software has it. Windows has a lot. Telemetry isn’t necessarily bad since it can, for example, help reveal and resolve bugs and usability problems, but it is easily (and has often been) abused by data-hungry corporations like MS, so disabling it is an advisable precaution. MS = Microsoft OSS = Open Source Software Group policies are administrative settings in Windows that control standards (for stuff like security, power management, licensing, file system and settings access, etc.) for user groups on a machine or network. Most users stick with the defaults but you can edit these yourself for a greater degree of control. Docker lets you run software inside “containers” to isolate them from the rest of the environment, exposing and/or virtualizing just the resources they need to run, and Compose is a related tool for defining one or more of these containers, how they interact, etc. To my knowledge there is no one-to-one equivalent for Windows. Obviously, many of these concepts relate to IT work, as are the use-cases I had in mind, but the software is simple enough for the average user if you just pick one of the premade playbooks. (The Atlas playbook is popular among gamers, for example.) Edit: added explanations for docker and telemetry