Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
272 107 79
  • That’s literally how “AI agents” are being marketed. “Tell it to do a thing and it will do it for you.”

    So? That doesn't mean they are supposed to be used like that.

    Show me any marketing that isn't full of lies.

  • The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

    Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn't show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

    Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

  • Emotion > Facts. Most people have been trained to blindly accept things and cheer on what fits with their agenda. Like technbro's exaggerating LLMs, or people like you misrepresenting LLMs as mere statistical word generators without intelligence. That's like saying a computer is just wires and switches, or missing the forest for the trees. Both is equally false.

    Yet if it fits with the emotional needs or with dogma, then other will agree. It's a convenient and comforting "A vs B" worldview we've been trained to accept. And so the satisfying notion and misinformation keeps spreading.

    LLMs tell us more about human intelligence and the human slop we've been generating. It tells us that most people are not that much more than statistical word generators.

    people like you misrepresenting LLMs as mere statistical word generators without intelligence.

    You've bought-in to the hype. I won't try to argue with you because you aren't cognizent of reality.

  • This post did not contain any content.

    We have created the overconfident intern in digital form.

  • When LLMs get it right it's because they're summarizing a stack overflow or GitHub snippet it was trained on. But you loose all the benefits of other humans commenting on the context, pitfalls and other alternatives.

    You’re not wrong, but often I’m just trying to do something I’ve done a thousand times before and I already know the pitfalls. Also, I’m sure I’ve copied code from stackoverflow before.

  • This post did not contain any content.

    Hey I went there

  • people like you misrepresenting LLMs as mere statistical word generators without intelligence.

    You've bought-in to the hype. I won't try to argue with you because you aren't cognizent of reality.

    You're projecting. Every accusation is a confession.

  • Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

    “You are an absolute fucking idiot who can barely code…”

    Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

    no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."

  • “You are an absolute fucking idiot who can barely code…”

    Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

    no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."

    This guy is the moral lesson at the start of the apocalypse movie

  • This post did not contain any content.

    This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

    All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

    DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

    The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.

  • Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

    I frequently find myself prompting it: "now show me the whole program with all the errors corrected." Sometimes I have to ask that two or three times, different ways, before it coughs up the next iteration ready to copy-paste-test. Most times when it gives errors I'll just write "address: " and copy-paste the error message in - frequently the text of the AI response will apologize, less frequently it will actually fix the error.

  • This guy is the moral lesson at the start of the apocalypse movie

    He's developing a toxic relationship with his AI agent. I don't think it's the best way to get what you want (demonstrating how to be abusive to the AI), but maybe it's the only method he is capable of getting results with.

  • This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

    All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

    DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

    The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.

    Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

  • Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

    I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

  • I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

    Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

  • Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

    So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?

  • This post did not contain any content.

    please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro

  • It’s usually vastly easier to verify an answer than posit one, if you have the patience to do so.

    I usually write 3x the code to test the code itself. Verification is often harder than implementation.

    It really depends on the context. Sometimes there are domains which require solving problems in NP, but where it turns out that most of these problems are actually not hard to solve by hand with a bit of tinkering. SAT solvers might completely fail, but humans can do it. Often it turns out that this means there's a better algorithm that can exploit commanalities in the data. But a brute force approach might just be to give it to an LLM and then verify its answer. Verifying NP problems is easy.

    (This is speculation.)

  • being able to do 30% of tasks successfully is already useful.

    If you have a good testing program, it can be.

    If you use AI to write the test cases...? I wouldn't fly on that airplane.

    obviously

  • Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate.
    LLMs don't get tired and they can be run in parallel.

    The problem is they are not i.i.d., so this doesn't really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we're already looking at "agents," so they're probably already doing chain-of-thought.

  • 175 Stimmen
    9 Beiträge
    16 Aufrufe
    E
    I'm sorry but that capitalisation is really off-putting. You're Not Writing A Headline You Know
  • 32 Stimmen
    6 Beiträge
    40 Aufrufe
    G
    Yes. I can't imagine that they will go after individuals. Businesses can't be so cavalier. But if creators don't pay the extra cost to make their models compliant with EU law, then they can't be used in the EU anyway. So it probably doesn't matter much. The Llama models with vision have the no-EU clause. It's because Meta wasn't allowed to train on European's data because of GDPR. The pure LLMs are fine. They might even be compliant, but we'll have to see what the courts think.
  • 2 Stimmen
    3 Beiträge
    17 Aufrufe
    vanth@reddthat.comV
    I only vacation in countries that have trained their LLMs to use line breaks.
  • Firefox is dead to me – and I'm not the only one who is fed up

    Technology technology
    55
    1
    45 Stimmen
    55 Beiträge
    154 Aufrufe
    F
    Never had issue with Firefox in my day to day use, sites load fine, uBlock stops all the annoyances and thankfully youtube works well for me.
  • 61 Stimmen
    17 Beiträge
    64 Aufrufe
    anzo@programming.devA
    I’ll probably never trust anything they’ve touched until I’ve taken it apart and put it back together again. Me too. But the vast majority of users need guardrails, and have a different threat model. Even those that also care about privacy, if they just want a solution that comes by default, this adtech 'fake' or 'superficial' solution does provide something. And anything is more than nothing.
  • France considers requiring Musk’s X to verify users’ age

    Technology technology
    20
    1
    142 Stimmen
    20 Beiträge
    76 Aufrufe
    C
    TBH, age verification services exist. If it becomes law, integrating them shouldn't be more difficult than integrating a OIDC login. So everyone should be able to do it. Depending on these services, you might not even need to give a name, or, because they are separate entities, don't give your name to the platform using them. Other parts of regulation are more difficult. Like these "upload filters" that need to figure out if something shared via a service is violating any copyright before it is made available.
  • 479 Stimmen
    81 Beiträge
    246 Aufrufe
    douglasg14b@lemmy.worldD
    Did I say that it did? No? Then why the rhetorical question for something that I never stated? Now that we're past that, I'm not sure if I think it's okay, but I at least recognize that it's normalized within society. And has been for like 70+ years now. The problem happens with how the data is used, and particularly abused. If you walk into my store, you expect that I am monitoring you. You expect that you are on camera and that your shopping patterns, like all foot traffic, are probably being analyzed and aggregated. What you buy is tracked, at least in aggregate, by default really, that's just volume tracking and prediction. Suffice to say that broad customer behavior analysis has been a thing for a couple generations now, at least. When you go to a website, why would you think that it is not keeping track of where you go and what you click on in the same manner? Now that I've stated that I do want to say that the real problems that we experience come in with how this data is misused out of what it's scope should be. And that we should have strong regulatory agencies forcing compliance of how this data is used and enforcing the right to privacy for people that want it removed.
  • 92 Stimmen
    42 Beiträge
    14 Aufrufe
    G
    You don’t understand. The tracking and spying is the entire point of the maneuver. The ‘children are accessing porn’ thing is just a Trojan horse to justify the spying. I understand what are you saying, I simply don't consider to check if a law is applied as a Trojan horse in itself. I would agree if the EU had said to these sites "give us all the the access log, a list of your subscriber, every data you gather and a list of every IP it ever connected to your site", and even this way does not imply that with only the IP you could know who the user is without even asking the telecom company for help. So, is it a Trojan horse ? Maybe, it heavily depend on how the EU want to do it. If they just ask "show me how you try to avoid that a minor access your material", which normally is the fist step, I don't see how it could be a Trojan horse. It could become, I agree on that. As you pointed out, it’s already illegal for them to access it, and parents are legally required to prevent their children from accessing it. No, parents are not legally required to prevent it. The seller (or provider) is legally required. It is a subtle but important difference. But you don’t lock down the entire population, or institute pre-crime surveillance policies, just because some parents are not going to follow the law. True. You simply impose laws that make mandatories for the provider to check if he can sell/serve something to someone. I mean asking that the cashier of mall check if I am an adult when I buy a bottle of wine is no different than asking to Pornhub to check if the viewer is an adult. I agree that in one case is really simple and in the other is really hard (and it is becoming harder by the day). You then charge the guilty parents after the offense. Ok, it would work, but then how do you caught the offendind parents if not checking what everyone do ? Is it not simpler to try to prevent it instead ?