Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
277 108 90
  • This post did not contain any content.

    We have created the overconfident intern in digital form.

  • When LLMs get it right it's because they're summarizing a stack overflow or GitHub snippet it was trained on. But you loose all the benefits of other humans commenting on the context, pitfalls and other alternatives.

    You’re not wrong, but often I’m just trying to do something I’ve done a thousand times before and I already know the pitfalls. Also, I’m sure I’ve copied code from stackoverflow before.

  • This post did not contain any content.

    Hey I went there

  • people like you misrepresenting LLMs as mere statistical word generators without intelligence.

    You've bought-in to the hype. I won't try to argue with you because you aren't cognizent of reality.

    You're projecting. Every accusation is a confession.

  • Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

    “You are an absolute fucking idiot who can barely code…”

    Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

    no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."

  • “You are an absolute fucking idiot who can barely code…”

    Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

    no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."

    This guy is the moral lesson at the start of the apocalypse movie

  • This post did not contain any content.

    This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

    All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

    DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

    The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.

  • Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

    I frequently find myself prompting it: "now show me the whole program with all the errors corrected." Sometimes I have to ask that two or three times, different ways, before it coughs up the next iteration ready to copy-paste-test. Most times when it gives errors I'll just write "address: " and copy-paste the error message in - frequently the text of the AI response will apologize, less frequently it will actually fix the error.

  • This guy is the moral lesson at the start of the apocalypse movie

    He's developing a toxic relationship with his AI agent. I don't think it's the best way to get what you want (demonstrating how to be abusive to the AI), but maybe it's the only method he is capable of getting results with.

  • This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

    All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

    DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

    The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.

    Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

  • Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

    I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

  • I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

    Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

  • Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

    So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?

  • This post did not contain any content.

    please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro

  • It’s usually vastly easier to verify an answer than posit one, if you have the patience to do so.

    I usually write 3x the code to test the code itself. Verification is often harder than implementation.

    It really depends on the context. Sometimes there are domains which require solving problems in NP, but where it turns out that most of these problems are actually not hard to solve by hand with a bit of tinkering. SAT solvers might completely fail, but humans can do it. Often it turns out that this means there's a better algorithm that can exploit commanalities in the data. But a brute force approach might just be to give it to an LLM and then verify its answer. Verifying NP problems is easy.

    (This is speculation.)

  • being able to do 30% of tasks successfully is already useful.

    If you have a good testing program, it can be.

    If you use AI to write the test cases...? I wouldn't fly on that airplane.

    obviously

  • Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate.
    LLMs don't get tired and they can be run in parallel.

    The problem is they are not i.i.d., so this doesn't really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we're already looking at "agents," so they're probably already doing chain-of-thought.

  • I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It's a lot like machine translation. I speak fluent C++, but I don't speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.

    I also don't speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.

    Is this useful? When C++ is getting banned for "security concerns" and Rust is the required language, it's at least a little helpful.

    I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.

  • No, it matters. Youre pushing the lie they want pushed.

    Hitler liked to paint, doesn't make painting wrong. The fact that big tech is pushing AI isn't evidence against the utility of AI.

    That common parlance is to call machine learning "AI" these days doesn't matter to me in the slightest. Do you have a definition of "intelligence"? Do you object when pathfinding is called AI? Or STRIPS? Or bots in a video game? Dare I say it, the main difference between those AIs and LLMs is their generality -- so why not just call it GAI at this point tbh. This is a question of semantics so it really doesn't matter to the deeper question. Doesn't matter if you call it AI or not, LLMs work the same way either way.

  • So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?

    I would definitely bet it's made up and poorly designed.

    I wish that weren't the case because having actual data would be nice, but these are almost always funded with some sort of intentional slant, for example nic vape safety where they clearly don't use the product sanely and then make wild claims about how there's lead in the vapes!

    Homie you're fucking running the shit completely dry for longer then any humans could possible actually hit the vape, no shit it's producing carcinogens.

    Go burn a bunch of paper and directly inhale the smoke and tell me paper is dangerous.

  • 88 Stimmen
    28 Beiträge
    20 Aufrufe
    P
    Then make those serious filters obligatory
  • Using Signal groups for activism

    Technology technology
    37
    1
    204 Stimmen
    37 Beiträge
    136 Aufrufe
    ulrich@feddit.orgU
    You're using a messaging app that was built with the express intent of being private and encrypted. Yes. You're asking why you can't have a right to privacy when you use your real name as your display handle in order to hide your phone number. I didn't ask anything. I stated it definitively. If you then use personal details as your screen name, you can't get mad at the app for not hiding your personal details. I've already explained this. I am not mad. I am telling you why it's a bad product for activism. Chatting with your friends and clients isn't what this app is for. That's...exactly what it's for. And I don't know where you got the idea that it's not. It's absurd. Certainly Snowden never said anything of the sort. Signal themselves never said anything of the sort. There are other apps for that. Of course there are. They're varying degrees of not private, secure, or easy to use.
  • How a Spyware App Compromised Assad’s Army

    Technology technology
    2
    1
    41 Stimmen
    2 Beiträge
    19 Aufrufe
    S
    I guess that's why you pay your soldiers. In the early summer of 2024, months before the opposition launched Operation Deterrence of Aggression, a mobile application began circulating among a group of Syrian army officers. It carried an innocuous name: STFD-686, a string of letters standing for Syria Trust for Development. ... The STFD-686 app operated with disarming simplicity. It offered the promise of financial aid, requiring only that the victim fill out a few personal details. It asked innocent questions: “What kind of assistance are you expecting?” and “Tell us more about your financial situation.” ... Determining officers’ ranks made it possible for the app’s operators to identify those in sensitive positions, such as battalion commanders and communications officers, while knowing their exact place of service allowed for the construction of live maps of force deployments. It gave the operators behind the app and the website the ability to chart both strongholds and gaps in the Syrian army’s defensive lines. The most crucial point was the combination of the two pieces of information: Disclosing that “officer X” was stationed at “location Y” was tantamount to handing the enemy the army’s entire operating manual, especially on fluid fronts like those in Idlib and Sweida.
  • 59 Stimmen
    3 Beiträge
    20 Aufrufe
    D
    This sounds like pokemon
  • OpenAI plans massive UAE data center project

    Technology technology
    4
    1
    0 Stimmen
    4 Beiträge
    23 Aufrufe
    V
    TD Cowen (which is basically the US arm of one of the largest Canadian investment banks) did an extensive report on the state of AI investment. What they found was that despite all their big claims about the future of AI, Microsoft were quietly allowing letters of intent for billions of dollars worth of new compute capacity to expire. Basically, scrapping future plans for expansion, but in a way that's not showy and doesn't require any kind of big announcement. The equivalent of promising to be at the party and then just not showing up. Not long after this reporting came out, it got confirmed by Microsoft, and not long after it came out that Amazon was doing the same thing. Ed Zitron has a really good write up on it; https://www.wheresyoured.at/power-cut/ Amazon isn't the big surprise, they've always been the most cautious of the big players on the whole AI thing. Microsoft on the other hand are very much trying to play things both ways. They know AI is fucked, which is why they're scaling back, but they've also invested a lot of money into their OpenAI partnership so now they have to justify that expenditure which means convincing investors that consumers absolutely love their AI products and are desparate for more. As always, follow the money. Stuff like the three mile island thing is mostly just applying for permits and so on at this point. Relatively small investments. As soon as it comes to big money hitting the table, they're pulling back. That's how you know how they really feel.
  • Pocket shutting down

    Technology technology
    2
    2 Stimmen
    2 Beiträge
    15 Aufrufe
    B
    Can anyone recommend a good alternative? I still use it to bookmark most wanted sites.
  • Building a personal archive of the web, the slow way

    Technology technology
    2
    1
    24 Stimmen
    2 Beiträge
    15 Aufrufe
    K
    Or just use Linkwarden or Karakeep (previously Hoarder)
  • How I use Mastodon in 2025 - fredrocha.net

    Technology technology
    11
    1
    0 Stimmen
    11 Beiträge
    43 Aufrufe
    J
    Sure. Efficiency isn't everything, though. At the end of the article there are a few people to get you started. Then you can go to your favorites in that list, and follow some of the people THEY are following. Rinse and repeat, follow boosted folks. You'll have 100 souls in no time.