linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

277 Beiträge 108 Kommentatoren 90 Aufrufe

N nalivai@discuss.tchncs.de

You probably wanted to show off how smart you are, but instead you showed that you can't even talk to people without help of your favourite slop bucket.
It didn't answer my curiosity about what came first, but it solidified my conviction that your brain is cooked all the way, probably beyond repair. I would say you need to seek professional help, but at this point you would interpret it as needing to talk to the autocomplete, and it will cook you even more.
It started funny, but I feel very sorry for you now, and it sucked all the humour out.
K This user is from outside of this forum
K This user is from outside of this forum
kameecoding@lemmy.world

schrieb zuletzt editiert von kameecoding@lemmy.world

#256

You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day
N 1 Antwort Letzte Antwort

0
S szczuroarturo@programming.dev

I actually have a fairly positive experience with ai ( copilot using claude specificaly ). Is it wrong a lot if you give it a huge task yes, so i dont do that and using as a very targeted solution if i am feeling very lazy today . Is it fast . Also not . I could actually be faster than ai in some cases.
But is it good if you are working for 6h and you just dont have enough mental capacity for the rest of the day. Yes . You can just prompt it specificaly enough to get desired result and just accept correct responses. Is it always good ,not really but good enough. Do i also suck after 3pm . Yes.
My main issue is actually the fact that it saves first and then asks you to pick if you want to use it. Not a problem usualy but if it crashes the generated code stays so that part sucks
J This user is from outside of this forum
J This user is from outside of this forum
jcg@halubilo.social

schrieb zuletzt editiert von

#257

You should give Claude Code a shot if you have a Claude subscription. I'd say this is where AI actually does a decent job: picking up human slack, under supervision, not replacing humans at anything. AI tools won't suddenly be productive enough to employ, but I as a professional can use it to accelerate my own workflow. It's actually where the risk of them taking jobs is real: for example, instead of 10 support people you can have 2 who just supervise the responses of an AI.

But of course, the Devil's in the detail. The only reason this is cost effective is because of VC money subsidizing and hiding the real cost of running these models.
1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

Why are you giving it data. It's a chat and language tool. It's not data based. You need something trained to work for that specific use. I think Wolfram Alpha has better tools for that.

I wouldn't trust it to calculate how many patio stones I need to build a project. But I trust it to tell me where a good source is on a topic or if a quote was said by who ever or if I need to remember something but I only have vague pieces like old timey historical witch burning related factoid about villagers who pulled people through a hole in the church wall or what was a the princess who was skeptic and sent her scientist to villages to try to calm superstitious panic .

Other uses are like digging around my computer and seeing what processes do what. How concepts work regarding the think I'm currently learning. So many excellent users. But I fucking wouldn't trust it to do any kind of calculation.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#258

Why are you giving it data

Because there's a button for that.

It’s output is dependent on the input

This thing that you said... It's false.
M 1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von davidagain@lemmy.world

#259

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper
U 1 Antwort Letzte Antwort

6
D davidagain@lemmy.world

Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.
K This user is from outside of this forum
K This user is from outside of this forum
knock_knock_lemmy_in@lemmy.world

schrieb zuletzt editiert von

#260

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.
D 1 Antwort Letzte Antwort

0
K kameecoding@lemmy.world

You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day
N This user is from outside of this forum
N This user is from outside of this forum
nalivai@discuss.tchncs.de

schrieb zuletzt editiert von

#261

Did the autocomplete told you to answer this? Don't answer, actually, save some energy.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
I This user is from outside of this forum
I This user is from outside of this forum
iopq@lemmy.world

schrieb zuletzt editiert von

#262

Now I'm curious, what's the average score for humans?
1 Antwort Letzte Antwort

2
T tja@programming.dev

The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.
T This user is from outside of this forum
T This user is from outside of this forum
timeworntraveler@lemmy.dbzer0.com

schrieb zuletzt editiert von

#263

you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up
T 1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
G This user is from outside of this forum
G This user is from outside of this forum
gargle@lemmy.world

schrieb zuletzt editiert von

#264

I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It's a lot of work. I stopped caring and moved on.

For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

Colour me unimpressed. I dread the day when they force the use of 'AI' on us at work.
1 Antwort Letzte Antwort

2
D davidagain@lemmy.world

Why are you giving it data

Because there's a button for that.

It’s output is dependent on the input

This thing that you said... It's false.
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#265

There's a sleep button on my laptop. Doesn't mean I would use it.

I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.
D 1 Antwort Letzte Antwort

0
V vanilla_puddinfudge@infosec.pub

America: "Good enough to handle 911 calls!"
D This user is from outside of this forum
D This user is from outside of this forum
decq@lemmy.world

schrieb zuletzt editiert von

#266

Is there really a plan to use this for 911 services??
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper
U This user is from outside of this forum
U This user is from outside of this forum
upgrayedd1776@sh.itjust.works

schrieb zuletzt editiert von

#267

sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
R 1 Antwort Letzte Antwort

1
U upgrayedd1776@sh.itjust.works

sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
R This user is from outside of this forum
R This user is from outside of this forum
rekorse@sh.itjust.works

schrieb zuletzt editiert von

#268

Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?
1 Antwort Letzte Antwort

1
T timeworntraveler@lemmy.dbzer0.com

you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up
T This user is from outside of this forum
T This user is from outside of this forum
tja@programming.dev

schrieb zuletzt editiert von

#269

No worries.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
S This user is from outside of this forum
S This user is from outside of this forum
sircac@lemmy.world

schrieb zuletzt editiert von

#270

Why would they be right beyond word sequence frecuencies?
1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

There's a sleep button on my laptop. Doesn't mean I would use it.

I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.

I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#271

Again with dismissing the evidence of my own eyes!

I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.
M 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#272

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.
K 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Again with dismissing the evidence of my own eyes!

I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#273

What does "I give it data to put in a formulaic sentence." mean here

Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.
D 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.
K This user is from outside of this forum
K This user is from outside of this forum
knock_knock_lemmy_in@lemmy.world

schrieb zuletzt editiert von

#274

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.
D 1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

What does "I give it data to put in a formulaic sentence." mean here

Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#275

I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

E

Tech support 'trained monkey’ fixed problem with two fingers
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
7

1

31 Stimmen

7 Beiträge

22 Aufrufe

S

I can understand why some programs only allow a single copy to be opened at once, something like email makes sense. However on Linux they got this right... if you try to open a program that is already running, it switches to the screen that program is on and restores the program window to the desktop. There's no guessing why the program "won't open", it just makes the logical choice that you want to see it. Heh that reminds me of another detail from that call... the guy also wasn't willing to reboot his computer (which would have solved the problem as well), but berated me for not knowing what I was doing for making the suggestion. Dude, it's Windows, things break constantly and a reboot generally resolves the issue.
P

Trump’s Defiance of TikTok Ban Prompted Immunity Promises to 10 Tech Companies
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
8

93 Stimmen

8 Beiträge

43 Aufrufe

E

It can be hard to guess who to bribe, or how big each bribe should be?
D

The IRS Tax Filing Software TurboTax Is Trying to Kill Just Got Open Sourced
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
145

1

1k Stimmen

145 Beiträge

372 Aufrufe

P

Not just that. The tax preparation industry has gotten tax more complex and harder to file in the US You get the government you can afford. The tax preparation industry has been able to buy several governments
D

Bookmark keywords, again (Firefox)
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
3

4 Stimmen

3 Beiträge

22 Aufrufe

B

This is terrible news. I also have a keyboard-centric workflow and also make heavy use of keyword bookmarks. I too use custom bookmarklets containing JavaScript that I can invoke with a few key strokes for multiple uses including: 1: Auto-expanding all nested Reddit comments on posts with many comments on desktop. 2: Downloading videos from certain web sites. 3: Playing a play-by-forum online board game. 4: Helping expand and aid in downloading images from a certain host. 5: Sending X (Twitter) URLs in the browser bar to Nitter or TWStalker. And all these without touching the mouse! It's really disappointing to read that Firefox could be taking so much capability in the browser away.
T

Pope Betting Odds: Bettors Lose Millions Predicting the New Pope as Polymarket Edge Fizzles Out
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
4

1

56 Stimmen

4 Beiträge

24 Aufrufe

C

!upliftingnews@lemmy.world
H

CrowdStrike Announces Layoffs Affecting 500 Employees
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
8

1

242 Stimmen

8 Beiträge

37 Aufrufe

S

This is where the magic of near meaningless corpo-babble comes in. The layoffs are part of a plan to aspirationally acheive the goal of $10b revenue by EoY 2025. What they are actually doing is a significant restructuring of the company, refocusing by outside hiring some amount of new people to lead or be a part of departments or positions that haven't existed before, or are being refocused to other priorities... ... But this process also involves laying off 500 of the 'least productive' or 'least mission critical' employees. So, technically, they can, and are, arguing that their new organizational paradigm will be so succesful that it actually will result in increased revenue, not just lower expenses. Generally corpos call this something like 'right-sizing' or 'refocusing' or something like that. ... But of course... anyone with any actual experience with working at a place that does this... will tell you roughly this is what happens: Turns out all those 'grunts' you let go of, well they actually do a lot more work in a bunch of weird, esoteric, bandaid solutions to keep everything going, than upper management was aware of... because middle management doesn't acknowledge or often even understand that that work was being done, because they are generally self-aggrandizing narcissist petty tyrants who spend more time in meetings fluffing themselves up than actually doing any useful management. Then, also, you are now bringing on new, outside people who look great on paper, to lead new or modified apartments... but they of course also do not have any institutional knowledge, as they are new. So now, you have a whole bunch of undocumented work that was being done, processes which were being followed... which is no longer being done, which is not documented.... and the new guys, even if they have the best intentions, now have to spend a quarter or two or three figuring out just exactly how much pre-existing middle management has been bullshitting about, figuring out just how much things do not actually function as they ssid it did... So now your efficiency improving restructuring is actually a chaotic mess. ... Now, this 'right sizing' is not always apocalyptically extremely bad, but it is also essentially never totally free from hiccups... and it increases stress, workload, and tensions between basically everyone at the company, to some extent. Here's Forbes explanation of this phenomenon, if you prefer an explanation of right sizing in corpospeak: https://www.forbes.com/advisor/business/rightsizing/
F

Decentralized Social Media Is the Only Alternative to the Tech Oligarchy
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
9

1

3 Stimmen

9 Beiträge

37 Aufrufe

G

So we need a documentary like Super Size Me but for social media. I think post that documentary coming out was the only time I've seen people's attitudes change in the general population about fast food.
P

Users ditch Glassdoor, stunned by site adding real names without consent
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

0 Stimmen

2 Beiträge

7 Aufrufe

A

I bet that information was already available to business owners. In other words, they totally knew it was you complaining about the toilet paper they used for example.