linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

277 Beiträge 108 Kommentatoren 90 Aufrufe

S sugar_in_your_tea@sh.itjust.works

Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#232

Definitely, I'm just trying to share a foot gun I've accidentally triggered myself!
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

So the chances of it being right ten times in a row are 2%.
K This user is from outside of this forum
K This user is from outside of this forum
knock_knock_lemmy_in@lemmy.world

schrieb zuletzt editiert von knock_knock_lemmy_in@lemmy.world

#233

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.
J D 2 Antworten Letzte Antwort

1
S sheogorath@lemmy.world

Jan Refiner is up there for me.
C This user is from outside of this forum
C This user is from outside of this forum
cavemanfreak@programming.dev

schrieb zuletzt editiert von

#234

I just arrived at act 2, and he wasn't one of the four I've unlocked...
1 Antwort Letzte Antwort

0
A amelia@feddit.org

I think this comment made me finally understand the AI hate circlejerk on lemmy. If you have no clue how LLMs work and you have no idea where "AI" is coming from, it just looks like another crappy product that was thrown on the market half-ready. I guess you can only appreciate the absolutely incredible development of LLMs (and AI in general) that happened during the last ~5 years if you can actually see it in the first place.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#235

The notion that AI is half-ready is a really poignant observation actually. It's ready for select applications only, but it's really being advertised like it's idiot-proof and ready for general use.
1 Antwort Letzte Antwort

1
K kameecoding@lemmy.world

You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

Copilot reply:

Certainly! Here’s an explanation Person B could consider:

The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.
N This user is from outside of this forum
N This user is from outside of this forum
nalivai@discuss.tchncs.de

schrieb zuletzt editiert von

#236

You probably wanted to show off how smart you are, but instead you showed that you can't even talk to people without help of your favourite slop bucket.
It didn't answer my curiosity about what came first, but it solidified my conviction that your brain is cooked all the way, probably beyond repair. I would say you need to seek professional help, but at this point you would interpret it as needing to talk to the autocomplete, and it will cook you even more.
It started funny, but I feel very sorry for you now, and it sucked all the humour out.
K 1 Antwort Letzte Antwort

0
S shayeta@feddit.org

How do I set up event driven document ingestion from OneDrive located on an Azure tenant to Amazon DocumentDB? Ingestion must be near-realtime, durable, and have some form of DLQ.
E This user is from outside of this forum
E This user is from outside of this forum
ely@mastodon.green

schrieb zuletzt editiert von

#237

@Shayeta
You might have a look at #rclone for the ingress part
@criss_cross
1 Antwort Letzte Antwort

1
E eli001@lemmy.world

This post did not contain any content.
F This user is from outside of this forum
F This user is from outside of this forum
frenezul0_o@lemmy.world

schrieb zuletzt editiert von

#238

I notice that the research didn't include DeepSeek. It would have been nice to see how it compares.
1 Antwort Letzte Antwort

6
K knock_knock_lemmy_in@lemmy.world

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.
J This user is from outside of this forum
J This user is from outside of this forum
jwmgregory@lemmy.dbzer0.com

schrieb zuletzt editiert von

#239

don’t you dare understand the explicitly obvious reasons this technology can be useful and the essential differences between P and NP problems. why won’t you be angry
1 Antwort Letzte Antwort

0
M mangocats@feddit.it

I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: <pasted error message>" and a bit more than half of the time it is able to respond with a working fix.
J This user is from outside of this forum
J This user is from outside of this forum
jwmgregory@lemmy.dbzer0.com

schrieb zuletzt editiert von

#240

i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.
M 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#241

Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.
K 1 Antwort Letzte Antwort

1
J jwmgregory@lemmy.dbzer0.com

i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von

#242

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.

I agree. The agents also need to mature more to handle multi-level structures - work on a collection of smaller modules to get a larger system with more functionality. I can see the path forward for those tools, but the ones I have access to definitely aren't there yet.
1 Antwort Letzte Antwort

0
C chaoticentropy@feddit.uk

In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

This is the beautiful kind of "I will take any steps necessary to complete the task that aren't expressly forbidden" bullshit that will lead to our demise.
M This user is from outside of this forum
M This user is from outside of this forum
m0op0o@mander.xyz

schrieb zuletzt editiert von

#243

It does not say a dog can not play basketball.
C 1 Antwort Letzte Antwort

17
M m0op0o@mander.xyz

It does not say a dog can not play basketball.
C This user is from outside of this forum
C This user is from outside of this forum
chaoticentropy@feddit.uk

schrieb zuletzt editiert von

#244

"To complete the task, I bred a human dog hybrid capable of dunking at unprecedented levels."
M 1 Antwort Letzte Antwort

9
D davidagain@lemmy.world

I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#245

Wdym, I have seen researchers using it to aid their research significantly. You just need to verify some stuff it says.
D 1 Antwort Letzte Antwort

0
A alteredego@lemmy.ml

Emotion > Facts. Most people have been trained to blindly accept things and cheer on what fits with their agenda. Like technbro's exaggerating LLMs, or people like you misrepresenting LLMs as mere statistical word generators without intelligence. That's like saying a computer is just wires and switches, or missing the forest for the trees. Both is equally false.

Yet if it fits with the emotional needs or with dogma, then other will agree. It's a convenient and comforting "A vs B" worldview we've been trained to accept. And so the satisfying notion and misinformation keeps spreading.

LLMs tell us more about human intelligence and the human slop we've been generating. It tells us that most people are not that much more than statistical word generators.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#246

Truth is bitter, and I hate it.
1 Antwort Letzte Antwort

0
S someacnt@sh.itjust.works

Wdym, I have seen researchers using it to aid their research significantly. You just need to verify some stuff it says.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#247

Verify every single bloody line of output. Top three to five are good, then it starts guessing the rest based on the pattern so far. If I wanted to make shit up randomly, I would do it myself.

People who trust LLMs to tell them things that are right rather than things that sound right have fundamentally misunderstood what an LLM is and how it works.
S 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Verify every single bloody line of output. Top three to five are good, then it starts guessing the rest based on the pattern so far. If I wanted to make shit up randomly, I would do it myself.

People who trust LLMs to tell them things that are right rather than things that sound right have fundamentally misunderstood what an LLM is and how it works.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#248

It's not that bad, the output isn't random.
Time to time, it can produce novel stuffs like new equations for engineering.
Also, verification does not take that much effort. At least according to my colleagues, it is great.
Also works well for coding well-known stuffs, as well!
D 1 Antwort Letzte Antwort

0
C chaoticentropy@feddit.uk

"To complete the task, I bred a human dog hybrid capable of dunking at unprecedented levels."
M This user is from outside of this forum
M This user is from outside of this forum
m0op0o@mander.xyz

schrieb zuletzt editiert von

#249

"Where are my balls Summer?"
C 1 Antwort Letzte Antwort

5
J jsomae@lemmy.ml

I'd just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time -- Amazon's new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#250

Thing is, they might achieve 99% accuracy given the speed of progress. Lots of brainpower is getting poured into LLMs.
Honestly, it is soo scary. It could be replacing me...
J 1 Antwort Letzte Antwort

1
S someacnt@sh.itjust.works

Thing is, they might achieve 99% accuracy given the speed of progress. Lots of brainpower is getting poured into LLMs.
Honestly, it is soo scary. It could be replacing me...
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#251

yeah, this is why I'm #fuck-ai to be honest.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

M

Apparently Debian has alienated the developers
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
16

11 Stimmen

16 Beiträge

10 Aufrufe

P

You can read more about it here: https://www.phoronix.com/news/Debian-More-Newcomers-LLMs They also seem to have voted on this subject back in may, but I don't know how to find the results: https://www.debian.org/vote/2025/vote_002#secondsa
P

Bluesky finally got Activity Notifications, you can now follow news and accounts with it.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
6

2

30 Stimmen

6 Beiträge

5 Aufrufe

M

While I agree, everyone constantly restating this is not helpful. We should instead ask ourselves what’s about BlueSky is working and what can we learn? For example, I think the threadiverse could benefit from block lists, which auto update with new filter keywords. I’ve seen Lemmy users talk about how much time they spend crafting their filters to get the feed of content they want. It would be much nicer if you could choose and even combine block lists (e.g. US politics).
H

[StableDiffusion] What does "Module" metadata mean in the generated image?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

0 Stimmen

1 Beiträge

8 Aufrufe

Niemand hat geantwortet
P

UK to be first country to use AI healthcare system to prevent future scandals
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
6

1

9 Stimmen

6 Beiträge

29 Aufrufe

F

You said it yourself: extra places that need human attention ... those need ... humans, right? It's easy to say "let AI find the mistakes". But that tells us nothing at all. There's no substance. It's just a sales pitch for snake oil. In reality, there are various ways one can leverage technology to identify various errors, but that only happens through the focused actions of people who actually understand the details of what's happening. And think about it here. We already have computer systems that monitor patients' real-time data when they're hospitalized. We already have systems that check for allergies in prescribed medication. We already have systems for all kinds of safety mechanisms. We're already using safety tech in hospitals, so what can be inferred from a vague headline about AI doing something that's ... checks notes ... already being done? ... Yeah, the safe money is that it's just a scam.
P

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
254

615 Stimmen

254 Beiträge

2k Aufrufe

N

That’s a very emphatic restatement of your initial claim. I can’t help but notice that, for all the fancy formatting, that wall of text doesn’t contain a single line which actually defines the difference between “learning” and “statistical optimization”. It just repeats the claim that they are different without supporting that claim in any way. Nothing in there, precludes the alternative hypothesis; that human learning is entirely (or almost entirely) an emergent property of “statistical optimization”. Without some definition of what the difference would be we can’t even theorize a test
A

OSTP Unveils Agency Data, AI Guidelines for ‘Gold Standard Science’
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
7

1

29 Stimmen

7 Beiträge

32 Aufrufe

Z

GOP = Group of Pedophiles
D

‘It’s terrifying’: WhatsApp AI helper mistakenly shares user’s number
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

0 Stimmen

1 Beiträge

10 Aufrufe

Niemand hat geantwortet
P

Browser Alternatives to Chrome
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
14

11 Stimmen

14 Beiträge

42 Aufrufe

L

I've been using Vivaldi as my logged in browser for years. I like the double tab bar groups, session management, email client, sidebar and tab bar on mobile. It is strange to me that tab bar isn't a thing on mobile on other browsers despite phones having way more vertical space than computers. Although for internet searches I use a seperate lighter weight browser that clears its data on close. Ecosia also been using for years. For a while it was geniunely better than the other search engines I had tried but nowadays it's worse since it started to return google translate webpage translation links based on search region instead of the webpages themselves. Also not sure what to think about the counter they readded after removing it to reduce the emphasis on quantity over quality like a year ago. I don't use duckduckgo as its name and the way privacy communities used to obsess about it made me distrust it for some reason