Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
83 47 0
  • The ones being implemented into emergency call centers are better though? Right?

    Yes! We've gotten them up to 94℅ wrong at the behest of insurance agencies.

  • LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn't need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

    Exactly! LLMs are useful when used properly, and terrible when not used properly, like any other tool. Here are some things they're great at:

    • writer's block - get something relevant on the page to get ideas flowing
    • narrowing down keywords for an unfamiliar topic
    • getting a quick intro to an unfamiliar topic
    • looking up facts you're having trouble remembering (i.e. you'll know it when you see it)

    Some things it's terrible at:

    • deep research - verify everything an LLM generated of accuracy is at all important
    • creating important documents/code
    • anything else where correctness is paramount

    I use LLMs a handful of times a week, and pretty much only when I'm stuck and need a kick in a new (hopefully right) direction.

  • This post did not contain any content.

    I haven't used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to "learn" new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can't believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)

  • Exactly! LLMs are useful when used properly, and terrible when not used properly, like any other tool. Here are some things they're great at:

    • writer's block - get something relevant on the page to get ideas flowing
    • narrowing down keywords for an unfamiliar topic
    • getting a quick intro to an unfamiliar topic
    • looking up facts you're having trouble remembering (i.e. you'll know it when you see it)

    Some things it's terrible at:

    • deep research - verify everything an LLM generated of accuracy is at all important
    • creating important documents/code
    • anything else where correctness is paramount

    I use LLMs a handful of times a week, and pretty much only when I'm stuck and need a kick in a new (hopefully right) direction.

    • narrowing down keywords for an unfamiliar topic
    • getting a quick intro to an unfamiliar topic
    • looking up facts you’re having trouble remembering (i.e. you’ll know it when you see it)

    I used to be able to use Google and other search engines to do these things before they went to shit in the pursuit of AI integration.

  • This post did not contain any content.

    The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

    OK, but I wonder who really tries to use AI for that?

    AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.

  • This post did not contain any content.

    "Gartner estimates only about 130 of the thousands of agentic AI vendors are real."

    This whole industry is so full of hype and scams, the bubble surely has to burst at some point soon.

  • The ones being implemented into emergency call centers are better though? Right?

    I called my local HVAC company recently. They switched to an AI operator. All I wanted was to schedule someone to come out and look at my system. It could not schedule an appointment. Like if you can't perform the simplest of tasks, what are you even doing? Other than acting obnoxiously excited to receive a phone call?

    • narrowing down keywords for an unfamiliar topic
    • getting a quick intro to an unfamiliar topic
    • looking up facts you’re having trouble remembering (i.e. you’ll know it when you see it)

    I used to be able to use Google and other search engines to do these things before they went to shit in the pursuit of AI integration.

    Google search was pretty bad at each of those, even when it was good. Finding new keywords to use is especially difficult the more niche your area of search is, and I've spent hours trying different combinations until I found a handful of specific keywords that worked.

    Likewise, search is bad for getting a broad summary, unless someone has bothered to write it on a blog. But most information goes way too deep and you still need multiple sources to get there.

    Fact lookup is one the better uses for search, but again, I usually need to remember which source had what I wanted, whereas the LLM can usually pull it out for me.

    I use traditional search most of the time (usually DuckDuckGo), and LLMs if I think it'll be more effective. We have some local models at work that I use, and they're pretty helpful most of the time.

  • This post did not contain any content.

    70% seems pretty optimistic based on my experience...

  • LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn't need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

    Because the tech industry hasn't had a real hit of it's favorite poison "private equity" in too long.

    The industry has played the same playbook since at least 2006. Likely before, but that's when I personally stated seeing it. My take is that they got addicted to the dotcom bubble and decided they can and should recreate the magic evey 3-5 years or so.

    This time it's AI, last it was crypto, and we've had web 2.0, 3.0, and a few others I'm likely missing.

    But yeah, it's sold like a panacea every time, when really it's revolutionary for like a handful of tasks.

  • This post did not contain any content.

    Wrong 70% doing what?

    I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

    Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

  • LLMs are an interesting tool to fuck around with, but I see things that are hilariously wrong often enough to know that they should not be used for anything serious. Shit, they probably shouldn't be used for most things that are not serious either.

    It's a shame that by applying the same "AI" naming to a whole host of different technologies, LLMs being limited in usability - yet hyped to the moon - is hurting other more impressive advancements.

    For example, speech synthesis is improving so much right now, which has been great for my sister who relies on screen reader software.

    Being able to recognise speech in loud environments, or removing background noice from recordings is improving loads too.

    As is things like pattern/image analysis which appears very promising in medical analysis.

    All of these get branded as "AI". A layperson might not realise that they are completely different branches of technology, and then therefore reject useful applications of "AI" tech, because they've learned not to trust anything branded as AI, due to being let down by LLMs.

    I tried to dictate some documents recently without paying the big bucks for specialized software, and was surprised just how bad Google and Microsoft's speech recognition still is. Then I tried getting Word to transcribe some audio talks I had recorded, and that resulted in unreadable stuff with punctuation in all the wrong places. You could just about make out what it meant to say, so I tried asking various LLMs to tidy it up. That resulted in readable stuff that was largely made up and wrong, which also left out large chunks of the source material. In the end I just had to transcribe it all by hand.

    It surprised me that these AI-ish products are still unable to transcribe speech coherently or tidy up a messy document without changing the meaning.

  • This post did not contain any content.

    In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user.

    Ah ah, what the fuck.

    This is so stupid it's funny, but now imagine what kind of other "creative solutions" they might find.

  • This post did not contain any content.

    While I do hope this leads to a pushback on "I just put all our corporate secrets into chatgpt":

    In the before times, people got their answers from stack overflow... or fricking youtube. And those are also wrong VERY VERY VERY often. Which is one of the biggest problems. The illegally scraped training data is from humans and humans are stupid.

  • This post did not contain any content.

    I tried to order food at Taco Bell drive through the other day and they had an AI thing taking your order. I was so frustrated that I couldn't order something that was on the menu I just drove to the window instead. The guy that worked there was more interested in lecturing me on how I need to order. I just said forget it and drove off.

    If you want to use AI, I'm not going to use your services or products unless I'm forced to. Looking at you Xfinity.

  • Wrong 70% doing what?

    I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

    Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

    Yeah, I mostly use ChatGPT as a better Google (asking, simple questions about mundane things), and if I kept getting wrong answers, I wouldn’t use it either.

  • Google search was pretty bad at each of those, even when it was good. Finding new keywords to use is especially difficult the more niche your area of search is, and I've spent hours trying different combinations until I found a handful of specific keywords that worked.

    Likewise, search is bad for getting a broad summary, unless someone has bothered to write it on a blog. But most information goes way too deep and you still need multiple sources to get there.

    Fact lookup is one the better uses for search, but again, I usually need to remember which source had what I wanted, whereas the LLM can usually pull it out for me.

    I use traditional search most of the time (usually DuckDuckGo), and LLMs if I think it'll be more effective. We have some local models at work that I use, and they're pretty helpful most of the time.

    No search engine or AI will be great with vague descriptions of niche subjects because by definition niche subjects are too uncommon to have a common pattern of 'close enough'.

  • LLMs are an interesting tool to fuck around with, but I see things that are hilariously wrong often enough to know that they should not be used for anything serious. Shit, they probably shouldn't be used for most things that are not serious either.

    It's a shame that by applying the same "AI" naming to a whole host of different technologies, LLMs being limited in usability - yet hyped to the moon - is hurting other more impressive advancements.

    For example, speech synthesis is improving so much right now, which has been great for my sister who relies on screen reader software.

    Being able to recognise speech in loud environments, or removing background noice from recordings is improving loads too.

    As is things like pattern/image analysis which appears very promising in medical analysis.

    All of these get branded as "AI". A layperson might not realise that they are completely different branches of technology, and then therefore reject useful applications of "AI" tech, because they've learned not to trust anything branded as AI, due to being let down by LLMs.

    Just add a search yesterday on the App Store and Google Play Store to see what new "productivity apps" are around. Pretty much every app now has AI somewhere in its name.

  • The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

    OK, but I wonder who really tries to use AI for that?

    AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.

    Yeah, we need more info to understand the results of this experiment.

    We need to know what exactly were these tasks that they claim were validated by experts. Because like you're saying, the tasks I saw were not what I was expecting.

    We need to know how the LLMs were set up. If you tell it to act like a chat bot and then you give it a task, it will have poorer results than if you set it up specifically to perform these sorts of tasks.

    We need to see the actual prompts given to the LLMs. It may be that you simply need an expert to write prompts in order to get much better results. While that would be disappointing today, it's not all that different from how people needed to learn to use search engines.

    We need to see the failure rate of humans performing the same tasks.

  • LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn't need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

    That's because they look like "talking machines" from various sci-fi. Normies feel as if they are touching the very edge of the progress. The rest of our life and the Internet kinda don't give that feeling anymore.

  • 206 Stimmen
    34 Beiträge
    104 Aufrufe
    remotelove@lemmy.caR
    I looked into that and the only question I really have is how geographically distributed the samples were. Other than that, It was an oversampled study, so <50% of the people were the control, of sorts. I don't fully understand how the sampling worked, but there is a substantial chart at the bottom of the study that shows the full distribution of responses. Even with under 1000 people, it seems legit.
  • Pirate Software "Stop Killing Games" Drama

    Technology technology
    9
    37 Stimmen
    9 Beiträge
    34 Aufrufe
    V
    Crazy how big of a following he has after the drama with Only Fangs at the beginning of he year.
  • 92 Stimmen
    16 Beiträge
    5 Aufrufe
    woelkchen@lemmy.worldW
    Telegram isn't banned in Ukraine. Can't be that bad.
  • 68 Stimmen
    17 Beiträge
    58 Aufrufe
    H
    Set up arrs, you basically set it and forget it.
  • 92 Stimmen
    5 Beiträge
    7 Aufrufe
    H
    This is interesting to me as I like to say the llms are basically another abstraction of search. Initially it was links with no real weight that had to be gone through and then various algorithms weighted the return, then the results started giving a small blurb so one did not have to follow every link, and now your basically getting a report which should have references to the sources. I would like to see this looking at how folks engage with an llm. Basically my guess is if one treats the llm as a helper and collaborates to create the product that they will remember more than if they treat it as a servant and just instructs them to do it and takes the output as is.
  • Why doesn't Nvidia have more competition?

    Technology technology
    22
    1
    33 Stimmen
    22 Beiträge
    78 Aufrufe
    B
    It’s funny how the article asks the question, but completely fails to answer it. About 15 years ago, Nvidia discovered there was a demand for compute in datacenters that could be met with powerful GPU’s, and they were quick to respond to it, and they had the resources to focus on it strongly, because of their huge success and high profitability in the GPU market. AMD also saw the market, and wanted to pursue it, but just over a decade ago where it began to clearly show the high potential for profitability, AMD was near bankrupt, and was very hard pressed to finance developments on GPU and compute in datacenters. AMD really tried the best they could, and was moderately successful from a technology perspective, but Nvidia already had a head start, and the proprietary development system CUDA was already an established standard that was very hard to penetrate. Intel simply fumbled the ball from start to finish. After a decade of trying to push ARM down from having the mobile crown by far, investing billions or actually the equivalent of ARM’s total revenue. They never managed to catch up to ARM despite they had the better production process at the time. This was the main focus of Intel, and Intel believed that GPU would never be more than a niche product. So when intel tried to compete on compute for datacenters, they tried to do it with X86 chips, One of their most bold efforts was to build a monstrosity of a cluster of Celeron chips, which of course performed laughably bad compared to Nvidia! Because as it turns out, the way forward at least for now, is indeed the massively parralel compute capability of a GPU, which Nvidia has refined for decades, only with (inferior) competition from AMD. But despite the lack of competition, Nvidia did not slow down, in fact with increased profits, they only grew bolder in their efforts. Making it even harder to catch up. Now AMD has had more money to compete for a while, and they do have some decent compute units, but Nvidia remains ahead and the CUDA problem is still there, so for AMD to really compete with Nvidia, they have to be better to attract customers. That’s a very tall order against Nvidia that simply seems to never stop progressing. So the only other option for AMD is to sell a bit cheaper. Which I suppose they have to. AMD and Intel were the obvious competitors, everybody else is coming from even further behind. But if I had to make a bet, it would be on Huawei. Huawei has some crazy good developers, and Trump is basically forcing them to figure it out themselves, because he is blocking Huawei and China in general from using both AMD and Nvidia AI chips. And the chips will probably be made by Chinese SMIC, because they are also prevented from using advanced production in the west, most notably TSMC. China will prevail, because it’s become a national project, of both prestige and necessity, and they have a massive talent mass and resources, so nothing can stop it now. IMO USA would clearly have been better off allowing China to use American chips. Now China will soon compete directly on both production and design too.
  • Microsoft pulls MS365 Business Premium from nonprofits

    Technology technology
    37
    1
    48 Stimmen
    37 Beiträge
    115 Aufrufe
    S
    That's the thing, I wish we could just switch all enterprises to Linux, but Microsoft developed a huge ecosystem that really does have good features. Unless something comparable comes up in the Linux world, I don't see Europe becoming independent of Microsoft any time soon
  • 163 Stimmen
    15 Beiträge
    58 Aufrufe
    L
    Online group started by a 15 year old in Texas playing Minecraft and watching extreme gore they said in this article. Were they also involved in said sexual exploiting of other kids, or was that just the spin offs that came from other people/countries? It all sounds terrible but I wonder if this was just a kid who did something for attention and then other perpetrators got involved and kept taking it further and down other rabbit holes. Definitely seems like a know what your kid is doing online scenario, but also yikes on all the 18+ members who joined and participated in such.