linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

272 Beiträge 107 Kommentatoren 78 Aufrufe

R rozodru@lemmy.world

“You are an absolute fucking idiot who can barely code…”

Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."
D This user is from outside of this forum
D This user is from outside of this forum
dragontypewyvern@midwest.social

schrieb zuletzt editiert von

#151

This guy is the moral lesson at the start of the apocalypse movie
M 1 Antwort Letzte Antwort

13
E eli001@lemmy.world

This post did not contain any content.
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von surph_ninja@lemmy.world

#152

This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.
C R 2 Antworten Letzte Antwort

4
V vivendi@programming.dev

Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

I'm not joking, it really works

For example:

Instead of "You are an intelligent coding assistant..."

"You are an absolute fucking idiot who can barely code..."
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von

#153

I frequently find myself prompting it: "now show me the whole program with all the errors corrected." Sometimes I have to ask that two or three times, different ways, before it coughs up the next iteration ready to copy-paste-test. Most times when it gives errors I'll just write "address: " and copy-paste the error message in - frequently the text of the AI response will apologize, less frequently it will actually fix the error.
1 Antwort Letzte Antwort

4
D dragontypewyvern@midwest.social

This guy is the moral lesson at the start of the apocalypse movie
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von

#154

He's developing a toxic relationship with his AI agent. I don't think it's the best way to get what you want (demonstrating how to be abusive to the AI), but maybe it's the only method he is capable of getting results with.
1 Antwort Letzte Antwort

4
S surph_ninja@lemmy.world

This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.
C This user is from outside of this forum
C This user is from outside of this forum
chaonaut@lemmy.4d2.org

schrieb zuletzt editiert von

#155

Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.
S 1 Antwort Letzte Antwort

5
C chaonaut@lemmy.4d2.org

Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von

#156

I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.
C 1 Antwort Letzte Antwort

3
S surph_ninja@lemmy.world

I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.
C This user is from outside of this forum
C This user is from outside of this forum
chaonaut@lemmy.4d2.org

schrieb zuletzt editiert von

#157

Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).
S 1 Antwort Letzte Antwort

3
C chaonaut@lemmy.4d2.org

Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von surph_ninja@lemmy.world

#158

So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?
J C 2 Antworten Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
F This user is from outside of this forum
F This user is from outside of this forum
fogetaboutit@programming.dev

schrieb zuletzt editiert von

#159

please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro
S J 2 Antworten Letzte Antwort

78
Z zbyte64@awful.systems

It’s usually vastly easier to verify an answer than posit one, if you have the patience to do so.

I usually write 3x the code to test the code itself. Verification is often harder than implementation.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von jsomae@lemmy.ml

#160

It really depends on the context. Sometimes there are domains which require solving problems in NP, but where it turns out that most of these problems are actually not hard to solve by hand with a bit of tinkering. SAT solvers might completely fail, but humans can do it. Often it turns out that this means there's a better algorithm that can exploit commanalities in the data. But a brute force approach might just be to give it to an LLM and then verify its answer. Verifying NP problems is easy.

(This is speculation.)
1 Antwort Letzte Antwort

1
M mangocats@feddit.it

being able to do 30% of tasks successfully is already useful.

If you have a good testing program, it can be.

If you use AI to write the test cases...? I wouldn't fly on that airplane.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#161

obviously
1 Antwort Letzte Antwort

2
K knock_knock_lemmy_in@lemmy.world

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate.
LLMs don't get tired and they can be run in parallel.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#162

The problem is they are not i.i.d., so this doesn't really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we're already looking at "agents," so they're probably already doing chain-of-thought.
K 1 Antwort Letzte Antwort

1
M mangocats@feddit.it

I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It's a lot like machine translation. I speak fluent C++, but I don't speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.

I also don't speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.

Is this useful? When C++ is getting banned for "security concerns" and Rust is the required language, it's at least a little helpful.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#163

I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.
M 1 Antwort Letzte Antwort

0
O outhouseperilous@lemmy.dbzer0.com

No, it matters. Youre pushing the lie they want pushed.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#164

Hitler liked to paint, doesn't make painting wrong. The fact that big tech is pushing AI isn't evidence against the utility of AI.

That common parlance is to call machine learning "AI" these days doesn't matter to me in the slightest. Do you have a definition of "intelligence"? Do you object when pathfinding is called AI? Or STRIPS? Or bots in a video game? Dare I say it, the main difference between those AIs and LLMs is their generality -- so why not just call it GAI at this point tbh. This is a question of semantics so it really doesn't matter to the deeper question. Doesn't matter if you call it AI or not, LLMs work the same way either way.
O 1 Antwort Letzte Antwort

0
S surph_ninja@lemmy.world

So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?
J This user is from outside of this forum
J This user is from outside of this forum
jakeroxs@sh.itjust.works

schrieb zuletzt editiert von

#165

I would definitely bet it's made up and poorly designed.

I wish that weren't the case because having actual data would be nice, but these are almost always funded with some sort of intentional slant, for example nic vape safety where they clearly don't use the product sanely and then make wild claims about how there's lead in the vapes!

Homie you're fucking running the shit completely dry for longer then any humans could possible actually hit the vape, no shit it's producing carcinogens.

Go burn a bunch of paper and directly inhale the smoke and tell me paper is dangerous.
S 1 Antwort Letzte Antwort

2
J jakeroxs@sh.itjust.works

I would definitely bet it's made up and poorly designed.

I wish that weren't the case because having actual data would be nice, but these are almost always funded with some sort of intentional slant, for example nic vape safety where they clearly don't use the product sanely and then make wild claims about how there's lead in the vapes!

Homie you're fucking running the shit completely dry for longer then any humans could possible actually hit the vape, no shit it's producing carcinogens.

Go burn a bunch of paper and directly inhale the smoke and tell me paper is dangerous.
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von

#166

Agreed. 70% is astoundingly high for today’s models. Something stinks.
1 Antwort Letzte Antwort

1
B blackmist@feddit.uk

We have created the overconfident intern in digital form.
J This user is from outside of this forum
J This user is from outside of this forum
jumping_redditor@sh.itjust.works

schrieb zuletzt editiert von

#167

Unfortunately marketing tries to sell it as a senior everything ologist
1 Antwort Letzte Antwort

14
T tja@programming.dev

DocumentDB is not for one drive documents (PDFs and such). It's for "documents" as in serialized objects (json or bson).
S This user is from outside of this forum
S This user is from outside of this forum
shayeta@feddit.org

schrieb zuletzt editiert von

#168

That's even better, I can just jam something in before it and churn the documents through an embedding model, thanks!
1 Antwort Letzte Antwort

1
E eli001@lemmy.world

This post did not contain any content.
S This user is from outside of this forum
S This user is from outside of this forum
socialmediarefugee@lemmy.world

schrieb zuletzt editiert von socialmediarefugee@lemmy.world

#169

I use it for very specific tasks and give as much information as possible. I usually have to give it more feedback to get to the desired goal. For instance I will ask it how to resolve an error message. I've even asked it for some short python code. I almost always get good feedback when doing that. Asking it about basic facts works too like science questions.

One thing I have had problems with is if the error is sort of an oddball it will give me suggestions that don't work with my OS/app version even though I gave it that info. Then I give it feedback and eventually it will loop back to its original suggestions, so it couldn't come up with an answer.

I've also found differences in chatgpt vs MS copilot with chatgpt usually being better results.
1 Antwort Letzte Antwort

2
F fogetaboutit@programming.dev

please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro
S This user is from outside of this forum
S This user is from outside of this forum
socialmediarefugee@lemmy.world

schrieb zuletzt editiert von

#170

And let it suck up 10% or so of all of the power in the region.
A 1 Antwort Letzte Antwort

12

Anmelden zum Antworten

E

Former and current Microsofties react to the latest layoffs
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
20

1

85 Stimmen

20 Beiträge

35 Aufrufe

E

Incredibly well said. And couldn't agree more! Especially after working as a game dev for Apple Arcade. We spent months proving to them their saving architecture was faulty and would lead to people losing their save file for each Apple Arcade game they play. We were ignored, and then told it was a dev problem. Cut to the launch of Arcade: every single game has several 1 star reviews about players losing their save files. This cannot be fixed by devs as it's an Apple problem, so devs have to figure out novel ways to prevent the issue from happening using their own time and resources. 1.5 years later, Apple finishes restructuring the entire backend of Arcade, fixing the problem. They tell all their devs to reimplement the saving architecture of their games to be compliant with Apples new backend or get booted from Arcade. This costs devs months of time to complete for literally zero return (Apple Arcade deals are upfront - little to no revenue is seen after launch). Apple used their trillions of dollars to ignore a massive backend issue that affected every player and developer on Apple Arcade. They then forced every dev to make an update to their game at their own expense just to keep it listed on Arcade. All while directing user frustration over the issue towards developers instead of taking accountability for launching a faulty product. Literally, these companies are run by sociopaths that have egos bigger than their paychecks. Issues like this are ignored as it's easier to place the blame on someone down the line. People like your manager end up getting promoted to the top of an office heirachy of bullshit, and everything the company makes just gets worse until whatever corpse is left is sold for parts to whatever bigger dumb company hasn't collapsed yet. It's really painful to watch, and even more painful to work with these idiots.
J

Big Tech Execs Commissioned into the Army [16:52 | JUL 03 2025 | Glenn Greenwald]
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
5

4 Stimmen

5 Beiträge

25 Aufrufe

M

Of course, if they’re in the army, can’t they be executed for treason and the like?
B

Mastodon: New Terms of Service IP clause cannot be terminated or revoked, not even by deleting content
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

0 Stimmen

1 Beiträge

5 Aufrufe

Niemand hat geantwortet
P

What could have caused the fatal Air India crash? An airplane engineer weighs in
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

12 Stimmen

1 Beiträge

10 Aufrufe

Niemand hat geantwortet
P

Menstrual tracking app data is a ‘gold mine’ for advertisers that risks women’s safety
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
144

1

781 Stimmen

144 Beiträge

226 Aufrufe

D

They can be LED I just want the aesthetic.
A

Amazon is reportedly training humanoid robots to deliver packages
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
143

1

300 Stimmen

143 Beiträge

310 Aufrufe

M

Yup, and people seem to frequently underestimate how ridiculously expensive running a fleet of humanoid robots would be (and don’t seem to realize how comparatively low the manual labor it’d replace is paid.)
E

U.S. Sanctions Cloud Provider ‘Funnull’ as Top Source of ‘Pig Butchering’ Scams – Krebs on Security
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
4

8 Stimmen

4 Beiträge

22 Aufrufe

S

%100 inherited and old lonely boomers. You'd be surprised how often the courts will not allow POA or Conservatorship to be appointed to the family after they get scammed. I have first hand experience with this and also have a friend as well.
A

Telegram bans $35B black markets used to sell stolen data, launder crypto
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
8

1

1 Stimmen

8 Beiträge

35 Aufrufe

L

I made a PayPal account like 20 years ago in a third world country. The only thing you needed then is an email and password. I have no real name on there and no PII, technically my bank card is attached but on PP itself there's no KYC. I think you could probably use some types of prepaid cards with it if you want to avoid using a bank altogether but for me this wasn't an issue, I just didn't want my ID on any records, I don't have any serious OpSec concerns otherwise. I'm sure you could either buy PayPal accounts like this if you needed to, or make one in a country that doesn't have KYC laws somehow. From there I'd add money to my balance and send money as F&F. At no point did I need an ID so in that sense there's no KYC. Some sellers on localmarket were fancy enough to list that they wanted an ID for KYC, but I'm sure you could just send them any random ID you made in paint from the republic of dave and you'd be fine.