linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

277 Beiträge 108 Kommentatoren 85 Aufrufe

V vanilla_puddinfudge@infosec.pub

America: "Good enough to handle 911 calls!"
D This user is from outside of this forum
D This user is from outside of this forum
decq@lemmy.world

schrieb zuletzt editiert von

#266

Is there really a plan to use this for 911 services??
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper
U This user is from outside of this forum
U This user is from outside of this forum
upgrayedd1776@sh.itjust.works

schrieb zuletzt editiert von

#267

sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
R 1 Antwort Letzte Antwort

1
U upgrayedd1776@sh.itjust.works

sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
R This user is from outside of this forum
R This user is from outside of this forum
rekorse@sh.itjust.works

schrieb zuletzt editiert von

#268

Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?
1 Antwort Letzte Antwort

1
T timeworntraveler@lemmy.dbzer0.com

you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up
T This user is from outside of this forum
T This user is from outside of this forum
tja@programming.dev

schrieb zuletzt editiert von

#269

No worries.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
S This user is from outside of this forum
S This user is from outside of this forum
sircac@lemmy.world

schrieb zuletzt editiert von

#270

Why would they be right beyond word sequence frecuencies?
1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

There's a sleep button on my laptop. Doesn't mean I would use it.

I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#271

Again with dismissing the evidence of my own eyes!

I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.
M 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#272

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.
K 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Again with dismissing the evidence of my own eyes!

I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#273

What does "I give it data to put in a formulaic sentence." mean here

Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.
D 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.
K This user is from outside of this forum
K This user is from outside of this forum
knock_knock_lemmy_in@lemmy.world

schrieb zuletzt editiert von

#274

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.
D 1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

What does "I give it data to put in a formulaic sentence." mean here

Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#275

I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.
1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von davidagain@lemmy.world

#276

You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
V This user is from outside of this forum
V This user is from outside of this forum
vane@lemmy.world

schrieb zuletzt editiert von

#277

Reading with CEO mindset. 3 out of 10 employees can be fired.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

R

Relo IT
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

2

1 Stimmen

1 Beiträge

8 Aufrufe

Niemand hat geantwortet
P

How data brokers shape your life
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

31 Stimmen

1 Beiträge

9 Aufrufe

Niemand hat geantwortet
P

An AI video ad is making a splash. Is it the future of advertising?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

10 Stimmen

2 Beiträge

17 Aufrufe

A

Gobble that AI slop NPR. Reads like sponsored content.
W

Is Matrix cooked?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
54

100 Stimmen

54 Beiträge

153 Aufrufe

W

Didn't know it only applied to UWP apps on Windows. That does seem like a pretty big problem then. it is mostly for compatibility reasons. no win32 programs are equipped to handle such granular permissions and sandboxing, they are all made with the assumption that they have access to whatever they need (other than other users' resources and things that require elevation). if Microsoft would have made that limitation to every kind of software, that Windows version would have probably been a failure in popularity because lots of software would have broken. I think S editions of windows is how they tried to go in that direction, with a more drastic way of simply just dropping support for 3rd party win32 programs. I don't still have a Mac readily available to test with but afaik it is any application that uses Apple's packaging format. ok, so if you run linux or windows utils in a compatibility layer, they still have less of a limited access? by which I mean graphical utilities. just tried with firefox, for macos it wanted to give me an .iso file (???) if so, it seems apple is doing roughly the same as microsoft with uwp and the appx format, and linux with flatpak: it's a choice for the user
P

Meta rolled back protections. Now hate is surging - What we're seeing: More hate, more fear, less freedom.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
65

1

391 Stimmen

65 Beiträge

116 Aufrufe

Z

Yes and no. Yes people are this stupid. But also bot networks. But also alt accounts. And many of those stupid people let the algorithm to pick them their political views, which is manipulated by both the bot activity and the platform holders.
P

Do you trust Xi with your 'private' browsing data? Apple and Google app stores still offer China-based VPNs.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
35

1

93 Stimmen

35 Beiträge

16 Aufrufe

D

Same as American companies. Send you targeted ads and news articles to influence your world view as a form of new soft power.
P

Unlock Your Computer With a Molecular Password
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

1

32 Stimmen

9 Beiträge

38 Aufrufe

C

One downside of the method is that each molecular message can only be read once, since decoding the polymers involves degrading them. New DRM just dropped. Imagine pouring rented movies into your TV like laundry detergent.
B

EU ruling: tracking-based advertising by Google, Microsoft, Amazon, X, across Europe has no legal basis
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
15

1

1 Stimmen

15 Beiträge

56 Aufrufe

G

I’m in the EU and PII definitely IS “a thing” here, Then let me be more clear: It is not a thing in EU law. With due respect, the level of intellectual functioning, in this case reading comprehension, you display is incompatible with being an IT professional in any country. If you are not trolling, then you should consult a physician.