Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
277 108 90
  • I'm in a workplace that has tried not to be overbearing about AI, but has encouraged us to use them for coding.

    I've tried to give mine some very simple tasks like writing a unit test just for the constructor of a class to verify current behavior, and it generates output that's both wrong and doesn't verify anything.

    I'm aware it sometimes gets better with more intricate, specific instructions, and that I can offer it further corrections, but at that point it's not even saving time. I would do this with a human in the hopes that they would continue to retain the knowledge, but I don't even have hopes for AI to apply those lessons in new contexts. In a way, it's been a sigh of relief to realize just like Dotcom, just like 3D TVs, just like home smart assistants, it is a bubble.

    The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

    Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn't show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

  • No, it matters. Youre pushing the lie they want pushed.

    And you're pushing a hate train with no aspect of nuance to show for it.

    Seems like you are even less than 30% useful. And that is mainly because you can be used as fertilizer.

  • and doesn't need to be exactly right

    What kind of tasks do you consider that don't need to be exactly right?

    Description generators for TTRPGs, as you will read through them afterwards anyway and correct when necessary.

    Generating lists of ideas. For creative writing, getting a bunch of ideas you can pick and choose from that fit the narrative you want.

    A search engine like Perplexity.ai which after searching summarizes the web page and adds a link to the page next to it. If the summary seems promising, you go to the real page to verify the actual information.

    Simple code like HTML pages and boilerplate code that you will still review afterwards anyway.

  • When LLMs get it right it's because they're summarizing a stack overflow or GitHub snippet it was trained on. But you loose all the benefits of other humans commenting on the context, pitfalls and other alternatives.

    You mean things you had to do anyway even if you didn't use LLMs?

  • That’s literally how “AI agents” are being marketed. “Tell it to do a thing and it will do it for you.”

    So? That doesn't mean they are supposed to be used like that.

    Show me any marketing that isn't full of lies.

  • The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

    Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn't show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

    Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

  • Emotion > Facts. Most people have been trained to blindly accept things and cheer on what fits with their agenda. Like technbro's exaggerating LLMs, or people like you misrepresenting LLMs as mere statistical word generators without intelligence. That's like saying a computer is just wires and switches, or missing the forest for the trees. Both is equally false.

    Yet if it fits with the emotional needs or with dogma, then other will agree. It's a convenient and comforting "A vs B" worldview we've been trained to accept. And so the satisfying notion and misinformation keeps spreading.

    LLMs tell us more about human intelligence and the human slop we've been generating. It tells us that most people are not that much more than statistical word generators.

    people like you misrepresenting LLMs as mere statistical word generators without intelligence.

    You've bought-in to the hype. I won't try to argue with you because you aren't cognizent of reality.

  • This post did not contain any content.

    We have created the overconfident intern in digital form.

  • When LLMs get it right it's because they're summarizing a stack overflow or GitHub snippet it was trained on. But you loose all the benefits of other humans commenting on the context, pitfalls and other alternatives.

    You’re not wrong, but often I’m just trying to do something I’ve done a thousand times before and I already know the pitfalls. Also, I’m sure I’ve copied code from stackoverflow before.

  • This post did not contain any content.

    Hey I went there

  • people like you misrepresenting LLMs as mere statistical word generators without intelligence.

    You've bought-in to the hype. I won't try to argue with you because you aren't cognizent of reality.

    You're projecting. Every accusation is a confession.

  • Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

    “You are an absolute fucking idiot who can barely code…”

    Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

    no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."

  • “You are an absolute fucking idiot who can barely code…”

    Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

    no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."

    This guy is the moral lesson at the start of the apocalypse movie

  • This post did not contain any content.

    This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

    All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

    DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

    The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.

  • Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

    I'm not joking, it really works

    For example:

    Instead of "You are an intelligent coding assistant..."

    "You are an absolute fucking idiot who can barely code..."

    I frequently find myself prompting it: "now show me the whole program with all the errors corrected." Sometimes I have to ask that two or three times, different ways, before it coughs up the next iteration ready to copy-paste-test. Most times when it gives errors I'll just write "address: " and copy-paste the error message in - frequently the text of the AI response will apologize, less frequently it will actually fix the error.

  • This guy is the moral lesson at the start of the apocalypse movie

    He's developing a toxic relationship with his AI agent. I don't think it's the best way to get what you want (demonstrating how to be abusive to the AI), but maybe it's the only method he is capable of getting results with.

  • This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

    All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

    DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

    The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.

    Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

  • Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.

    I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

  • I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.

    Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

  • Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).

    So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?

  • I made a porn scroller without the clutter

    Technology technology
    1
    1
    4 Stimmen
    1 Beiträge
    9 Aufrufe
    Niemand hat geantwortet
  • 47 Stimmen
    4 Beiträge
    7 Aufrufe
    T
    Very interesting paper, and grade A irony to begin the title with “delving” while finding that “delve” is one of the top excess words/markers of LLM writing. Moreover, the authors highlight a few excerpts that “illustrate the LLM-style flowery language” including By meticulously delving into the intricate web connecting […] and […], this comprehensive chapter takes a deep dive into their involvement as significant risk factors for […]. …and then they clearly intentionally conclude the discussion section thus We hope that future work will meticulously delve into tracking LLM usage more accurately and assess which policy changes are crucial to tackle the intricate challenges posed by the rise of LLMs in scientific publishing. Great work.
  • 264 Stimmen
    24 Beiträge
    83 Aufrufe
    glitchvid@lemmy.worldG
    Republicans are the biggest suckers there are. There's a reason as soon as the jig is up grifters pivot to conservative talking points.
  • 148 Stimmen
    92 Beiträge
    126 Aufrufe
    B
    You don't even need a VPN. Only the legit sites will play ball. Porn will still be there.
  • 1k Stimmen
    95 Beiträge
    16 Aufrufe
    G
    Obviously the law must be simple enough to follow so that for Jim’s furniture shop is not a problem nor a too high cost to respect it, but it must be clear that if you break it you can cease to exist as company. I think this may be the root of our disagreement, I do not believe that there is any law making body today that is capable of an elegantly simple law. I could be too naive, but I think it is possible. We also definitely have a difference on opinion when it comes to the severity of the infraction, in my mind, while privacy is important, it should not have the same level of punishments associated with it when compared to something on the level of poisoning water ways; I think that a privacy law should hurt but be able to be learned from while in the poison case it should result in the bankruptcy of a company. The severity is directly proportional to the number of people affected. If you violate the privacy of 200 million people is the same that you poison the water of 10 people. And while with the poisoning scenario it could be better to jail the responsible people (for a very, very long time) and let the company survive to clean the water, once your privacy is violated there is no way back, a company could not fix it. The issue we find ourselves with today is that the aggregate of all privacy breaches makes it harmful to the people, but with a sizeable enough fine, I find it hard to believe that there would be major or lasting damage. So how much money your privacy it's worth ? 6 For this reason I don’t think it is wise to write laws that will bankrupt a company off of one infraction which was not directly or indirectly harmful to the physical well being of the people: and I am using indirectly a little bit more strict than I would like to since as I said before, the aggregate of all the information is harmful. The point is that the goal is not to bankrupt companies but to have them behave right. The penalty associated to every law IS the tool that make you respect the law. And it must be so high that you don't want to break the law. I would have to look into the laws in question, but on a surface level I think that any company should be subjected to the same baseline privacy laws, so if there isn’t anything screwy within the law that apple, Google, and Facebook are ignoring, I think it should apply to them. Trust me on this one, direct experience payment processors have a lot more rules to follow to be able to work. I do not want jail time for the CEO by default but he need to know that he will pay personally if the company break the law, it is the only way to make him run the company being sure that it follow the laws. For some reason I don’t have my usual cynicism when it comes to this issue. I think that the magnitude of loses that vested interests have in these companies would make it so that companies would police themselves for fear of losing profits. That being said I wouldn’t be opposed to some form of personal accountability on corporate leadership, but I fear that they will just end up finding a way to create a scapegoat everytime. It is not cynicism. I simply think that a huge fine to a single person (the CEO for example) is useless since it too easy to avoid and if it really huge realistically it would be never paid anyway so nothing usefull since the net worth of this kind of people is only on the paper. So if you slap a 100 billion file to Musk he will never pay because he has not the money to pay even if technically he is worth way more than that. Jail time instead is something that even Musk can experience. In general I like laws that are as objective as possible, I think that a privacy law should be written so that it is very objectively overbearing, but that has a smaller fine associated with it. This way the law is very clear on right and wrong, while also giving the businesses time and incentive to change their practices without having to sink large amount of expenses into lawyers to review every minute detail, which is the logical conclusion of the one infraction bankrupt system that you seem to be supporting. Then you write a law that explicitally state what you can do and what is not allowed is forbidden by default.
  • The FDA Is Approving Drugs Without Evidence They Work

    Technology technology
    69
    1
    506 Stimmen
    69 Beiträge
    224 Aufrufe
    L
    Now you hit me curious too. This was my source on Texas https://www.texasalmanac.com/place-types/town Also the total number of total towns is over 4,000 with only 3k unincorporated, I did get the numbers wrong even in Texas. I had looked at Wikipedia but could not find totals, only lists
  • 374 Stimmen
    69 Beiträge
    235 Aufrufe
    T
    In those situations I usually enable 1.5x.
  • Skype was shut down for good today

    Technology technology
    6
    1
    8 Stimmen
    6 Beiträge
    30 Aufrufe
    L
    ::: spoiler spoiler sadfsafsafsdfsd :::