linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

272 Beiträge 107 Kommentatoren 79 Aufrufe

M mangocats@feddit.it

I've been R&D forever, so at my level the question isn't "does the code work?" we pretty much assume that will take care of itself, eventually. Our critical question is: "is the code trying to do something valuable, or not?" We make all kinds of stuff do what the requirements call for it to do, but so often those requirements are asking for worthless or even counterproductive things...
Z This user is from outside of this forum
Z This user is from outside of this forum
zbyte64@awful.systems

schrieb zuletzt editiert von zbyte64@awful.systems

#213

Literally the opposite experience when I helped material scientists with their R&D. Breaking in production would mean people who get paid 2x more than me are suddenly unable to do their job. But then again, our requirements made sense because we would literally look at a manual process to automate with the engineers. What you describe sounds like hell to me. There are greener pastures.
M 1 Antwort Letzte Antwort

0
R ramenjunkie@midwest.social

Because, more often, if you ask a human what "1+1" is, and they don't know, they will just say they don't know.

AI will confidently insist its 3, and make up math algorythms to prove it.

And every company is pushing AI out on everyone like its always 10000% correct.

Its also shown its not intelligent. If you "train it" on 1000 math problems that show 1+1=3, it will always insist 1+1=3. It does not actually know how to add numbers, despite being a computer.
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von surph_ninja@lemmy.world

#214

Haha. Sure. Humans never make up bullshit to confidently sell a fake answer.

Fucking ridiculous.
1 Antwort Letzte Antwort

2
Z zbyte64@awful.systems

Literally the opposite experience when I helped material scientists with their R&D. Breaking in production would mean people who get paid 2x more than me are suddenly unable to do their job. But then again, our requirements made sense because we would literally look at a manual process to automate with the engineers. What you describe sounds like hell to me. There are greener pastures.
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von

#215

Yeah, sometimes the requirements write themselves and in those cases successful execution is "on the critical path."

Unfortunately, our requirements are filtered from our paying customers through an ever rotating cast of Marketing and Sales characters who, nominally, are our direct customers so we make product for them - but they rarely have any clear or consistent vision of what they want, but they know they want new stuff - that's for sure.
Z 1 Antwort Letzte Antwort

0
M mangocats@feddit.it

Yeah, sometimes the requirements write themselves and in those cases successful execution is "on the critical path."

Unfortunately, our requirements are filtered from our paying customers through an ever rotating cast of Marketing and Sales characters who, nominally, are our direct customers so we make product for them - but they rarely have any clear or consistent vision of what they want, but they know they want new stuff - that's for sure.
Z This user is from outside of this forum
Z This user is from outside of this forum
zbyte64@awful.systems

schrieb zuletzt editiert von

#216

When requirements are "Whatever" then by all means use the "Whatever" machine: https://eev.ee/blog/2025/07/03/the-rise-of-whatever/

And then look for a better gig because such an environment is going to be toxic to your skill set. The more exacting the shop, the better they pay.
M 1 Antwort Letzte Antwort

0
J jsomae@lemmy.ml

I'd just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time -- Amazon's new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.
A This user is from outside of this forum
A This user is from outside of this forum
amelia@feddit.org

schrieb zuletzt editiert von

#217

I think this comment made me finally understand the AI hate circlejerk on lemmy. If you have no clue how LLMs work and you have no idea where "AI" is coming from, it just looks like another crappy product that was thrown on the market half-ready. I guess you can only appreciate the absolutely incredible development of LLMs (and AI in general) that happened during the last ~5 years if you can actually see it in the first place.
J 1 Antwort Letzte Antwort

3
M mangocats@feddit.it

I have been using AI to write (little, near trivial) programs. It's blindingly obvious that it could be feeding this code to a compiler and catching its mistakes before giving them to me, but it doesn't... yet.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#218

Agents do that loop pretty well now, and Claude now uses your IDE's LSP to help it code and catch errors in flow. I think Windsurf or Cursor also do that also.

The tooling has improved a ton in the last 3 months.
1 Antwort Letzte Antwort

1
Z zbyte64@awful.systems

When requirements are "Whatever" then by all means use the "Whatever" machine: https://eev.ee/blog/2025/07/03/the-rise-of-whatever/

And then look for a better gig because such an environment is going to be toxic to your skill set. The more exacting the shop, the better they pay.
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von mangocats@feddit.it

#219

The more exacting the shop, the better they pay.

That hasn't been my experience, but it sounds like good advice anyway. My experience has been that the more profitable the parent company, the better the job security and the better the pay too. Once "in," tune in to the culture and align with the people at your level and above who seem like they'll be sticking around long term. If the company isn't financially secure, all bets are off and you should be seeking, and taking, a better offer when you can find one.

I knocked around startups for 10/22 years (depending on how you characterize that one 12 year gig that ended with everybody laid off...) The pay was good enough, but job security just wasn't on the menu. Finally, one got bought by a big fish and I've been in the belly of the beast for 11 years now.
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#220

I can't believe how absolutely silly a lot of you sound with this.

LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

Also another person commented that seen the pattern you also see means we're psychotic.

All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.
D 1 Antwort Letzte Antwort

0
S spankmonkey@lemmy.world

LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn't need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#221

It is truly terrible marketing. It's been obvious to me for years the value is in giving it to people and enabling them to do more with less, not outright replacing humans, especially not expert humans.

I use AI/LLMs pretty much every day now. I write MCP servers and automate things with it and it's mind blowing how productive it makes me.

Just today I used these tools in a highly supervised way to complete a task that would have been a full day of tedius work, all done in an hour. That is fucking fantastic, it's means I get to spend that time on more important things.

It's like giving an accountant excel. Excel isn't replacing them, but it's taking care of specific tasks so they can focus on better things.

On the reliability and accuracy front there is still a lot to be desired, sure. But for supervised chats where it's calling my tools it's pretty damn good.
1 Antwort Letzte Antwort

1
S sugar_in_your_tea@sh.itjust.works

than reading an actual intro on an unfamiliar topic

The LLM helps me know what to look for in order to find that unfamiliar topic.

For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

It’s going to say something plausible, and you tautologically are not in a position to verify it.

I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

general search on the web?

I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von wise_pancake@lemmy.ca

#222

It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.
S 1 Antwort Letzte Antwort

1
S sugar_in_your_tea@sh.itjust.works

than reading an actual intro on an unfamiliar topic

The LLM helps me know what to look for in order to find that unfamiliar topic.

For example, I was tasked to support a file format that's common in a very niche field and never used elsewhere, and unfortunately shares an extension with a very common file format, so searching for useful data was nearly impossible. So I asked the LLM for details about the format and applications of it, provided what I knew, and it spat out a bunch of keywords that I then used to look up more accurate information about that file format. I only trusted the LLM output to the extent of finding related, industry-specific terms to search up better information.

Likewise, when looking for libraries for a coding project, none really stood out, so I asked the LLM to compare the popular libraries for solving a given problem. The LLM spat out a bunch of details that were easy to verify (and some were inaccurate), which helped me narrow what I looked for in that library, and the end result was that my search was done in like 30 min (about 5 min dealing w/ LLM, and 25 min checking the projects and reading a couple blog posts comparing some of the libraries the LLM referred to).

I think this use case is a fantastic use of LLMs, since they're really good at generating text related to a query.

It’s going to say something plausible, and you tautologically are not in a position to verify it.

I absolutely am though. If I am merely having trouble recalling a specific fact, asking the LLM to generate it is pretty reasonable. There are a ton of cases where I'll know the right answer when I see it, like it's on the tip of my tongue but I'm having trouble materializing it. The LLM might spit out two wrong answers along w/ the right one, but it's easy to recognize which is the right one.

I'm not going to ask it facts that I know I don't know (e.g. some historical figure's birth or death date), that's just asking for trouble. But I'll ask it facts that I know that I know, I'm just having trouble recalling.

The right use of LLMs, IMO, is to generate text related to a topic to help facilitate research. It's not great at doing the research though, but it is good at helping to formulate better search terms or generate some text to start from for whatever task.

general search on the web?

I agree, it's not great for general search. It's great for turning a nebulous question into better search terms.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#223

One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.
S 1 Antwort Letzte Antwort

2
F floofloof@lemmy.ca

I tried to dictate some documents recently without paying the big bucks for specialized software, and was surprised just how bad Google and Microsoft's speech recognition still is. Then I tried getting Word to transcribe some audio talks I had recorded, and that resulted in unreadable stuff with punctuation in all the wrong places. You could just about make out what it meant to say, so I tried asking various LLMs to tidy it up. That resulted in readable stuff that was largely made up and wrong, which also left out large chunks of the source material. In the end I just had to transcribe it all by hand.

It surprised me that these AI-ish products are still unable to transcribe speech coherently or tidy up a messy document without changing the meaning.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#224

I don't know basic solutions that are super good, but whisper sbd the whisper derivatives I hear are decent for dictation these days.

I have no idea how to run then though.
1 Antwort Letzte Antwort

0
W wise_pancake@lemmy.ca

It's a bit frustrating that finding these tools useful is so often met with it can't be useful for that, when it definitely is.

More than any other tool in history LLMs have a huge dose of luck involved and a learning curve on how to ask the right things the right way. And those method change and differ between models too.
S This user is from outside of this forum
S This user is from outside of this forum
sugar_in_your_tea@sh.itjust.works

schrieb zuletzt editiert von

#225

And that's the same w/ traditional search engines, the difference is that we're used to search engines and LLMs are new. Learn how to use the tool and decide for yourself when it's useful.
1 Antwort Letzte Antwort

2
W wise_pancake@lemmy.ca

One word of caution with AI searxh is that it's weirdly vulnerable to SEO.

If you search for "best X for Y" and a company has an article on their blog about how their product solves a problem the AI can definitely summarize that into a "users don't like that foolib because of ...". At least that's been my experience looking for software vendors.
S This user is from outside of this forum
S This user is from outside of this forum
sugar_in_your_tea@sh.itjust.works

schrieb zuletzt editiert von

#226

Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.
W 1 Antwort Letzte Antwort

2
N nalivai@discuss.tchncs.de

Were you prone to this weird leaps of logic before your brain was fried by talking to LLMs, or did you start being a fan of talking to LLMs because your ability to logic was...well...that?
K This user is from outside of this forum
K This user is from outside of this forum
kameecoding@lemmy.world

schrieb zuletzt editiert von kameecoding@lemmy.world

#227

You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

Copilot reply:

Certainly! Here’s an explanation Person B could consider:

The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.
N 1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

I can't believe how absolutely silly a lot of you sound with this.

LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

Also another person commented that seen the pattern you also see means we're psychotic.

All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von davidagain@lemmy.world

#228

If that’s the quality of answer you’re getting, then it’s a user error

No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?
M 1 Antwort Letzte Antwort

0
T timeworntraveler@lemmy.dbzer0.com

and? we can understand 256 where AI can't, that's the point.
T This user is from outside of this forum
T This user is from outside of this forum
tja@programming.dev

schrieb zuletzt editiert von

#229

The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.
T 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

About 0.02
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#230

So the chances of it being right ten times in a row are 2%.
K 1 Antwort Letzte Antwort

0
K kameecoding@lemmy.world

For me as a software developer the accuracy is more in the 95%+ range.

On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.

On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.

Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so

Inser into (column1,....,column n)
Values row1,
Row 2
Row n

Addig a new column with test data for each row is a PITA, but copilot handles it without issue.

Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.

So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#231

For your database test data, I usually write a helper that defaults those columns to base values, so I can pass in lists of dictionaries, then the test cases are easier to modify and read.

It's also nice because you're only including the fields you use in your unit test, the rest are default valid you don't need to care about.
1 Antwort Letzte Antwort

0
S sugar_in_your_tea@sh.itjust.works

Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#232

Definitely, I'm just trying to share a foot gun I've accidentally triggered myself!
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

P

UK to be first country to use AI healthcare system to prevent future scandals
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
6

1

9 Stimmen

6 Beiträge

29 Aufrufe

F

You said it yourself: extra places that need human attention ... those need ... humans, right? It's easy to say "let AI find the mistakes". But that tells us nothing at all. There's no substance. It's just a sales pitch for snake oil. In reality, there are various ways one can leverage technology to identify various errors, but that only happens through the focused actions of people who actually understand the details of what's happening. And think about it here. We already have computer systems that monitor patients' real-time data when they're hospitalized. We already have systems that check for allergies in prescribed medication. We already have systems for all kinds of safety mechanisms. We're already using safety tech in hospitals, so what can be inferred from a vague headline about AI doing something that's ... checks notes ... already being done? ... Yeah, the safe money is that it's just a scam.
P

2000 LGBTQ+ activists to lawmakers & civil society orgs that support Trump censorship bills: stay home from pride
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
19

1

53 Stimmen

19 Beiträge

38 Aufrufe

Z

What is the technology angle here? What does this have to do with technology?
G

Linus Torvalds and Bill Gates Meet for the First Time Ever
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
222

787 Stimmen

222 Beiträge

913 Aufrufe

M

Hmm, you kind of lost me with these metaphors. No offence, I'm just not sure what is supposed to represent what here.
D

Album 'Hysteria' Out Now
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

1 Stimmen

1 Beiträge

7 Aufrufe

Niemand hat geantwortet
I

16 Billion Apple, Facebook, Google And Other Passwords Leaked — Act Now
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
13

1

74 Stimmen

13 Beiträge

47 Aufrufe

B

This appears to just be a compilation of other leaks: https://www.bleepingcomputer.com/news/security/no-the-16-billion-credentials-leak-is-not-a-new-data-breach/ Still not a bad idea to change passwords and make sure MFA is enabled.
T

Klarna’s AI replaced 700 workers — Now the fintech CEO wants humans back after $40B fall
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
30

157 Stimmen

30 Beiträge

135 Aufrufe

D

These are the 700 Actually Indians
A

what are your thoughts on Bidirectional brain-computer interfaces ?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
10

1

14 Stimmen

10 Beiträge

39 Aufrufe

M

Exactly, we don’t know how the brain would adapt to having electric impulses wired right in to it, and it could adapt in some seriously negative ways.
S

Microsoft's AI Secretly Copying All Your Private Messages
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
4

1

0 Stimmen

4 Beiträge

25 Aufrufe

S

Forgive me for not explaining better. Here are the terms potentially needing explanation. Provisioning in this case is initial system setup, the kind of stuff you would do manually after a fresh install, but usually implies a regimented and repeatable process. Virtual Machine (VM) snapshots are like a save state in a game, and are often used to reset a virtual machine to a particular known-working condition. Preboot Execution Environment (PXE, aka ‘network boot’) is a network adapter feature that lets you boot a physical machine from a hosted network image rather than the usual installation on locally attached storage. It’s probably tucked away in your BIOS settings, but many computers have the feature since it’s a common requirement in commercial deployments. As with the VM snapshot described above, a PXE image is typically a known-working state that resets on each boot. Non-virtualized means not using hardware virtualization, and I meant specifically not running inside a virtual machine. Local-only means without a network or just not booting from a network-hosted image. Telemetry refers to data collecting functionality. Most software has it. Windows has a lot. Telemetry isn’t necessarily bad since it can, for example, help reveal and resolve bugs and usability problems, but it is easily (and has often been) abused by data-hungry corporations like MS, so disabling it is an advisable precaution. MS = Microsoft OSS = Open Source Software Group policies are administrative settings in Windows that control standards (for stuff like security, power management, licensing, file system and settings access, etc.) for user groups on a machine or network. Most users stick with the defaults but you can edit these yourself for a greater degree of control. Docker lets you run software inside “containers” to isolate them from the rest of the environment, exposing and/or virtualizing just the resources they need to run, and Compose is a related tool for defining one or more of these containers, how they interact, etc. To my knowledge there is no one-to-one equivalent for Windows. Obviously, many of these concepts relate to IT work, as are the use-cases I had in mind, but the software is simple enough for the average user if you just pick one of the premade playbooks. (The Atlas playbook is popular among gamers, for example.) Edit: added explanations for docker and telemetry