linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

278 Beiträge 108 Kommentatoren 123 Aufrufe

E eli001@lemmy.world

This post did not contain any content.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von davidagain@lemmy.world

#259

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper
U 1 Antwort Letzte Antwort

6
D davidagain@lemmy.world

Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.
K This user is from outside of this forum
K This user is from outside of this forum
knock_knock_lemmy_in@lemmy.world

schrieb zuletzt editiert von

#260

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.
D 1 Antwort Letzte Antwort

0
K kameecoding@lemmy.world

You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day
N This user is from outside of this forum
N This user is from outside of this forum
nalivai@discuss.tchncs.de

schrieb zuletzt editiert von

#261

Did the autocomplete told you to answer this? Don't answer, actually, save some energy.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
I This user is from outside of this forum
I This user is from outside of this forum
iopq@lemmy.world

schrieb zuletzt editiert von

#262

Now I'm curious, what's the average score for humans?
1 Antwort Letzte Antwort

2
T tja@programming.dev

The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.
T This user is from outside of this forum
T This user is from outside of this forum
timeworntraveler@lemmy.dbzer0.com

schrieb zuletzt editiert von

#263

you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up
T 1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
G This user is from outside of this forum
G This user is from outside of this forum
gargle@lemmy.world

schrieb zuletzt editiert von

#264

I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It's a lot of work. I stopped caring and moved on.

For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

Colour me unimpressed. I dread the day when they force the use of 'AI' on us at work.
1 Antwort Letzte Antwort

2
D davidagain@lemmy.world

Why are you giving it data

Because there's a button for that.

It’s output is dependent on the input

This thing that you said... It's false.
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#265

There's a sleep button on my laptop. Doesn't mean I would use it.

I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.
D 1 Antwort Letzte Antwort

0
V vanilla_puddinfudge@infosec.pub

America: "Good enough to handle 911 calls!"
D This user is from outside of this forum
D This user is from outside of this forum
decq@lemmy.world

schrieb zuletzt editiert von

#266

Is there really a plan to use this for 911 services??
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper
U This user is from outside of this forum
U This user is from outside of this forum
upgrayedd1776@sh.itjust.works

schrieb zuletzt editiert von

#267

sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
R 1 Antwort Letzte Antwort

1
U upgrayedd1776@sh.itjust.works

sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
R This user is from outside of this forum
R This user is from outside of this forum
rekorse@sh.itjust.works

schrieb zuletzt editiert von

#268

Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?
1 Antwort Letzte Antwort

2
T timeworntraveler@lemmy.dbzer0.com

you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up
T This user is from outside of this forum
T This user is from outside of this forum
tja@programming.dev

schrieb zuletzt editiert von

#269

No worries.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
S This user is from outside of this forum
S This user is from outside of this forum
sircac@lemmy.world

schrieb zuletzt editiert von

#270

Why would they be right beyond word sequence frecuencies?
1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

There's a sleep button on my laptop. Doesn't mean I would use it.

I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#271

Again with dismissing the evidence of my own eyes!

I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.
M 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#272

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.
K 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Again with dismissing the evidence of my own eyes!

I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#273

What does "I give it data to put in a formulaic sentence." mean here

Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.
D 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.
K This user is from outside of this forum
K This user is from outside of this forum
knock_knock_lemmy_in@lemmy.world

schrieb zuletzt editiert von

#274

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.
D 1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

What does "I give it data to put in a formulaic sentence." mean here

Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#275

I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.
M 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von davidagain@lemmy.world

#276

You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
V This user is from outside of this forum
V This user is from outside of this forum
vane@lemmy.world

schrieb zuletzt editiert von

#277

Reading with CEO mindset. 3 out of 10 employees can be fired.
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#278

This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

A

Business Jet MRO Market: Soaring Demand and Evolving Technologies"
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

2

0 Stimmen

1 Beiträge

0 Aufrufe

Niemand hat geantwortet
E

Google faces EU antitrust complaint over AI Overviews
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

1

147 Stimmen

9 Beiträge

43 Aufrufe

S

That's not as clever as you think it is.
T

Twitter opens up to Community Notes written by AI bots
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

1

44 Stimmen

9 Beiträge

42 Aufrufe

G

Stop fucking using twitter. Stop posting about it, stop posting things that link to it. Delete your account like you should have already.
E

Last year China generated almost 3 times as much solar power as the EU did, and it's close to overtaking all OECD countries put together (whose combined population is 1.38 billion people)
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
149

2

455 Stimmen

149 Beiträge

99 Aufrufe

E

They will say something like solar went from 600gw to 1000 thats a 66% increase this year and coal only increased 40% except coal is 3600gw to 6400. Hrmmmm, maybe these numbers are outdated? Based on this coal and gas are down: In Q1 2025, solar generation rose 48% compared to the same period in 2024. Solar power reached 254 TWh, making up 10% of total electricity. This was the largest increase among all clean energy sources. Coal-fired electricity dropped by 4%, falling to 1,421 TWh. Gas-fired power also went down by 4%, reaching 67 TWh https://carboncredits.com/china-sets-clean-energy-record-in-early-2025-with-951-tw/ are no where close to what is required to meet their climate goals Which ones in particular are you talking about? Trump signs executive order directing US withdrawal from the Paris climate agreement — again https://apnews.com/article/trump-paris-agreement-climate-change-788907bb89fe307a964be757313cdfb0 China vowed on Tuesday to continue participating in two cornerstone multinational arrangements -- the World Health Organization and Paris climate accord -- after newly sworn-in US President Donald Trump ordered withdrawals from them. https://www.france24.com/en/live-news/20250121-china-says-committed-to-who-paris-climate-deal-after-us-pulls-out What's that saying? You hate it when the person you hate is doing good? I can't remember what it is I can't fault them for what they're doing at the moment, even if they are run by an evil dictatorship and do pollute the most I’m not sure how european defense spending is relevant It suggests there is money available in the bank to fund solar/wind/battery, but instead they are preparing for? something? what? who knows. France can make a fighter jet at home but not solar panels apparently. Prehaps they would be made in a country with environmental and labour laws if governments legislated properly to prevent companies outsourcing manufacturing. However this doesnt absolve china. China isnt being forced at Gunpoint to produce these goods with low labour regulation and low environmental regulation. You're right, it doesn't absolve china, and I avoid purchasing things from them wherever possible, my solar panels and EV were made in South Korea, my home battery was made in Germany, there are only a few things in my house made in China, most of them I got second hand but unfortunately there is no escaping the giant of manufacturing. With that said it's one thing for me to sit here and tut tut at China, but I realise I am not most people, the most clearest example is the extreme anti-ai, anti-billionaire bias on this platform, in real life most people don't give a fuck, they love Amazon/Microsoft/Google/Apple etc, they can't go a day without them. So I consider myself a realist, if you want people to buy your stuff then you will need to make the conditions possible for them to WANT to buy your stuff, not out of some moral lecture and Europe isn't doing that, if we look at energy prices: Can someone actually point out to me where this comes from? ... At the end of the day energy is a small % of EU household spending I was looking at corporate/business energy use: Major European companies are already moving to cut costs and retain their competitive edge. For example, Thyssenkrupp, Germany’s largest steelmaker, said on Monday it would slash 11,000 jobs in its steel division by 2030, in a major corporate reshuffle. https://oilprice.com/Latest-Energy-News/World-News/High-Energy-Costs-Continue-to-Plague-European-Industry.html Prices have since fallen but are still high compared to other countries. A poll by Germany's DIHK Chambers of Industry and Commerce of around 3,300 companies showed that 37% were considering cutting production or moving abroad, up from 31% last year and 16% in 2022. For energy-intensive industrial firms some 45% of companies were mulling slashing output or relocation, the survey showed. "The trust of the German economy in energy policy is severely damaged," Achim Dercks, DIHK deputy chief executive said, adding that the government had not succeeded in providing companies with a perspective for reliable and affordable energy supply. https://www.reuters.com/business/energy/more-german-companies-mull-relocation-due-high-energy-prices-survey-2024-08-01/ I've seen nothing to suggest energy prices in the EU are SO cheap that it's worth moving manufacturing TO Europe, and this is what annoys me the most. I've pointed this out before but they have an excellent report on the issues: https://commission.europa.eu/document/download/97e481fd-2dc3-412d-be4c-f152a8232961_en?filename=The+future+of+European+competitiveness+_+A+competitiveness+strategy+for+Europe.pdf Then they put out this Competitive Compass: https://commission.europa.eu/topics/eu-competitiveness/competitiveness-compass_en But tbh every week in the EU it seems like they are chasing after some other goal. This would be great, it would have been greater 10 years ago. Agreed
R

Trump extends TikTok ban deadline by another 90 days
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
6

1

24 Stimmen

6 Beiträge

28 Aufrufe

N

TikTacos
D

Musk's X sues New York state over social media hate speech law
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

1 Stimmen

1 Beiträge

10 Aufrufe

Niemand hat geantwortet
B

getoffpocket.com, my guide to Pocket alternatives, just got a redesign
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
23

85 Stimmen

23 Beiträge

102 Aufrufe

B

I've made some updates. There are many perspectives to view a guide like this. I hope there are some improvements to the self-hosting perspective. https://getoffpocket.com/
J

Palantir Exposed: The New Deep State [27:27 | JUN 10 2025 | Glenn Greenwald]
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
60

100 Stimmen

60 Beiträge

196 Aufrufe

J

We all get emotional on certain topics; it is understandable. All is well, peace.