Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
272 107 79
  • I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

    In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.

    😆 I can't believe how absolutely silly a lot of you sound with this.

    LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

    It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

    Also another person commented that seen the pattern you also see means we're psychotic.

    All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.

  • LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn't need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

    It is truly terrible marketing. It's been obvious to me for years the value is in giving it to people and enabling them to do more with less, not outright replacing humans, especially not expert humans.

    I use AI/LLMs pretty much every day now. I write MCP servers and automate things with it and it's mind blowing how productive it makes me.

    Just today I used these tools in a highly supervised way to complete a task that would have been a full day of tedius work, all done in an hour. That is fucking fantastic, it's means I get to spend that time on more important things.

    It's like giving an accountant excel. Excel isn't replacing them, but it's taking care of specific tasks so they can focus on better things.

    On the reliability and accuracy front there is still a lot to be desired, sure. But for supervised chats where it's calling my tools it's pretty damn good.

  • than reading an actual intro on an unfamiliar topic

    The LLM helps me know what to look for in order to find that unfamiliar topic.

    For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

    Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

    I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

    It’s going to say something plausible, and you tautologically are not in a position to verify it.

    I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

    I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

    The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

    general search on the web?

    I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.

    It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

    More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.

  • than reading an actual intro on an unfamiliar topic

    The LLM helps me know what to look for in order to find that unfamiliar topic.

    For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

    Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

    I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

    It’s going to say something plausible, and you tautologically are not in a position to verify it.

    I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

    I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

    The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

    general search on the web?

    I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.

    One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

    If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.

  • I tried to dictate some documents recently without paying the big bucks for specialized software, and was surprised just how bad Google and Microsoft's speech recognition still is. Then I tried getting Word to transcribe some audio talks I had recorded, and that resulted in unreadable stuff with punctuation in all the wrong places. You could just about make out what it meant to say, so I tried asking various LLMs to tidy it up. That resulted in readable stuff that was largely made up and wrong, which also left out large chunks of the source material. In the end I just had to transcribe it all by hand.

    It surprised me that these AI-ish products are still unable to transcribe speech coherently or tidy up a messy document without changing the meaning.

    I don't know basic solutions that are super good, but whisper sbd the whisper derivatives I hear are decent for dictation these days.

    I have no idea how to run then though.

  • It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

    More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.

    And that's the same w/ traditional search engines, the difference is that we're used to search engines and LLMs are new. Learn how to use the tool and decide for yourself when it's useful.

  • One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

    If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.

    Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.

  • Were you prone to this weird leaps of logic before your brain was fried by talking to LLMs, or did you start being a fan of talking to LLMs because your ability to logic was...well...that?

    You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

    Copilot reply:

    Certainly! Here’s an explanation Person B could consider:

    The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

    However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.

  • 😆 I can't believe how absolutely silly a lot of you sound with this.

    LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

    It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

    Also another person commented that seen the pattern you also see means we're psychotic.

    All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.

    If that’s the quality of answer you’re getting, then it’s a user error

    No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

    You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

    You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?

  • and? we can understand 256 where AI can't, that's the point.

    The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.

  • So the chances of it being right ten times in a row are 2%.

  • For me as a software developer the accuracy is more in the 95%+ range.

    On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.

    On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.

    Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so

    Inser into (column1,....,column n)
    Values row1,
    Row 2
    Row n

    Addig a new column with test data for each row is a PITA, but copilot handles it without issue.

    Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.

    So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.

    For your database test data, I usually write a helper that defaults those columns to base values, so I can pass in lists of dictionaries, then the test cases are easier to modify and read.

    It's also nice because you're only including the fields you use in your unit test, the rest are default valid you don't need to care about.

  • Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.

    Definitely, I'm just trying to share a foot gun I've accidentally triggered myself!

  • So the chances of it being right ten times in a row are 2%.

    No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.

  • Jan Refiner is up there for me.

    I just arrived at act 2, and he wasn't one of the four I've unlocked...

  • I think this comment made me finally understand the AI hate circlejerk on lemmy. If you have no clue how LLMs work and you have no idea where "AI" is coming from, it just looks like another crappy product that was thrown on the market half-ready. I guess you can only appreciate the absolutely incredible development of LLMs (and AI in general) that happened during the last ~5 years if you can actually see it in the first place.

    The notion that AI is half-ready is a really poignant observation actually. It's ready for select applications only, but it's really being advertised like it's idiot-proof and ready for general use.

  • You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

    Copilot reply:

    Certainly! Here’s an explanation Person B could consider:

    The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

    However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.

    You probably wanted to show off how smart you are, but instead you showed that you can't even talk to people without help of your favourite slop bucket.
    It didn't answer my curiosity about what came first, but it solidified my conviction that your brain is cooked all the way, probably beyond repair. I would say you need to seek professional help, but at this point you would interpret it as needing to talk to the autocomplete, and it will cook you even more.
    It started funny, but I feel very sorry for you now, and it sucked all the humour out.

  • How do I set up event driven document ingestion from OneDrive located on an Azure tenant to Amazon DocumentDB? Ingestion must be near-realtime, durable, and have some form of DLQ.

    @Shayeta
    You might have a look at for the ingress part
    @criss_cross

  • This post did not contain any content.

    I notice that the research didn't include DeepSeek. It would have been nice to see how it compares.

  • No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.

    don’t you dare understand the explicitly obvious reasons this technology can be useful and the essential differences between P and NP problems. why won’t you be angry 😠

  • 336 Stimmen
    19 Beiträge
    76 Aufrufe
    R
    What I'm speaking about is that it should be impossible to do some things. If it's possible, they will be done, and there's nothing you can do about it. To solve the problem of twiddled social media (and moderation used to assert dominance) we need a decentralized system of 90s Web reimagined, and Fediverse doesn't deliver it - if Facebook and Reddit are feudal states, then Fediverse is a confederation of smaller feudal entities. A post, a person, a community, a reaction and a change (by moderator or by the user) should be global entities (with global identifiers, so that the object by id of #0000001a2b3c4d6e7f890 would be the same object today or 10 years later on every server storing it) replicated over a network of servers similarly to Usenet (and to an IRC network, but in an IRC network servers are trusted, so it's not a good example for a global system). Really bad posts (or those by persons with history of posting such) should be banned on server level by everyone. The rest should be moderated by moderator reactions\changes of certain type. Ideally, for pooling of resources and resilience, servers would be separated by types into storage nodes (I think the name says it, FTP servers can do the job, but no need to be limited by it), index nodes (scraping many storage nodes, giving out results in structured format fit for any user representation, say, as a sequence of posts in one community, or like a list of communities found by tag, or ... , and possibly being connected into one DHT for Kademlia-like search, since no single index node will have everything), and (like in torrents?) tracker nodes for these and for identities, I think torrent-like announce-retrieve service is enough - to return a list of storage nodes storing, say, a specified partition (subspace of identifiers of objects, to make looking for something at least possibly efficient), or return a list of index nodes, or return a bunch of certificates and keys for an identity (should be somehow cryptographically connected to the global identifier of a person). So when a storage node comes online, it announces itself to a bunch of such trackers, similarly with index nodes, similarly with a user. One can also have a NOSTR-like service for real-time notifications by users. This way you'd have a global untrusted pooled infrastructure, allowing to replace many platforms. With common data, identities, services. Objects in storage and index services can be, say, in a format including a set of tags and then the body. So a specific application needing to show only data related to it would just search on index services and display only objects with tags of, say, "holo_ns:talk.bullshit.starwars" and "holo_t:post", like a sequence of posts with ability to comment, or maybe it would search objects with tags "holo_name:My 1999-like Star Wars holopage" and "holo_t:page" and display the links like search results in Google, and then clicking on that you'd see something presented like a webpage, except links would lead to global identifiers (or tag expressions interpreted by the particular application, who knows). (An index service may return, say, an array of objects, each with identifier, tags, list of locations on storage nodes where it's found or even bittorrent magnet links, and a free description possibly ; then the user application can unify responses of a few such services to avoid repetitions, maybe sort them, represent them as needed, so on.) The user applications for that common infrastructure can be different at the same time. Some like Facebook, some like ICQ, some like a web browser, some like a newsreader. (Star Wars is not a random reference, my whole habit of imagining tech stuff is from trying to imagine a science fiction world of the future, so yeah, this may seem like passive dreaming and it is.)
  • 147 Stimmen
    55 Beiträge
    59 Aufrufe
    01189998819991197253@infosec.pub0
    I meant to download from the official Microsoft site. Kudos on getting your mum on Linux! I was unable to keep mine on it : / Maybe I'm missing something, but this is from the "Download Windows 11 Disk Image (ISO) for x64 devices" section from the official Microsoft site, but I don't see any option to buy or mention of it: Before you begin downloading an ISO Make sure you have: An internet connection (internet service provider fees may apply). Sufficient data storage available on the computer, USB, or external drive you are downloading the .iso file to. A blank DVD disc with at least 8GB (and DVD burner) to create a bootable disc. We recommend using a blank USB or blank DVD, because any content on it will be deleted during installation. If you receive a “disc image file is too large” message while attempting to burn a DVD bootable disc from an ISO file, consider using a higher capacity Dual Layer DVD.
  • 23 Stimmen
    4 Beiträge
    4 Aufrufe
    D
    Whew..... None of the important file hosters ..
  • What was Radiant AI, anyway?

    Technology technology
    6
    1
    20 Stimmen
    6 Beiträge
    31 Aufrufe
    T
    In fact Daggerfall was almost nothing but quests and other content like that.
  • Affordable Assignments

    Technology technology
    1
    1
    0 Stimmen
    1 Beiträge
    11 Aufrufe
    Niemand hat geantwortet
  • 112 Stimmen
    34 Beiträge
    126 Aufrufe
    fredselfish@lemmy.worldF
    Nlow that was a great show. I always wanted in on that too. Back when Radio Shack still dealt in parts for remote control cars.
  • AI cheating surge pushes schools into chaos

    Technology technology
    25
    45 Stimmen
    25 Beiträge
    92 Aufrufe
    C
    Sorry for the late reply, I had to sit and think on this one for a little bit. I think there are would be a few things going on when it comes to designing a course to teach critical thinking, nuances, and originality; and they each have their own requirements. For critical thinking: The main goal is to provide students with a toolbelt for solving various problems. Then instilling the habit of always asking "does this match the expected outcome? What was I expecting?". So usually courses will be setup so students learn about a tool, practice using the tool, then have a culminating assignment on using all the tools. Ideally, the problems students face at the end require multiple tools to solve. Nuance mainly naturally comes with exposure to the material from a professional - The way a mechanical engineer may describe building a desk will probably differ greatly compared to a fantasy author. You can also explain definitions and industry standards; but thats really dry. So I try to teach nuances via definitions by mixing in the weird nuances as much as possible with jokes. Then for originality; I've realized I dont actually look for an original idea; but something creative. In a classroom setting, you're usually learning new things about a subject so a student's knowledge of that space is usually very limited. Thus, an idea that they've never heard about may be original to them, but common for an industry expert. For teaching originality creativity, I usually provide time to be creative & think, and provide open ended questions as prompts to explore ideas. My courses that require originality usually have it as a part of the culminating assignment at the end where they can apply their knowledge. I'll also add in time where students can come to me with preliminary ideas and I can provide feedback on whether or not it passes the creative threshold. Not all ideas are original, but I sometimes give a bit of slack if its creative enough. The amount of course overhauling to get around AI really depends on the material being taught. For example, in programming - you teach critical thinking by always testing your code, even with parameters that don't make sense. For example: Try to add 123 + "skibbidy", and see what the program does.
  • 209 Stimmen
    30 Beiträge
    17 Aufrufe
    L
    people do get desensitized down there from watching alot of porn, and there were other forums discussing thier "ED" from decade of porn watching.