AI agents wrong ~70% of time: Carnegie Mellon study
-
Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049
so the chances of it being right ten times in a row are less than one thousandth of a percent.
No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.
That looks better. Even with a fair coin, 10 heads in a row is almost impossible.
And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.
-
You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day
Did the autocomplete told you to answer this? Don't answer, actually, save some energy.
-
This post did not contain any content.
Now I'm curious, what's the average score for humans?
-
The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.
you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up
-
This post did not contain any content.
I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It's a lot of work. I stopped caring and moved on.
For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.
Colour me unimpressed. I dread the day when they force the use of 'AI' on us at work.
-
Why are you giving it data
Because there's a button for that.
It’s output is dependent on the input
This thing that you said... It's false.
There's a sleep button on my laptop. Doesn't mean I would use it.
I'm just trying to say you're saying the feature that everyone kind of knows doesn't work. Chatgpt is not trained to do calculations well.
I just like technology and I think and fully believe the left hatred of it is not logical. I believe it stems from a lot of media be and headlines. Why there's this push From media is a question I would like to know more. But overall, I see a lot of the same makers of bullshit yellow journalism for this stuff on the left as I do for similar bullshit on the right wing spaces towards other things.
-
America: "Good enough to handle 911 calls!"
Is there really a plan to use this for 911 services??
-
Wow. 30% accuracy was the high score!
From the article:Testing agents at the office
For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.
They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.
the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.
Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).
"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper
sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
-
sounds like the fault of the researchers not to build better tests or understand the limits of the software to use it right
Are you arguing they should have built a test that makes AI perform better? How are you offended on behalf of AI?
-
you're right, the dumb of AI is completely comparable to the dumb of human, there's no difference worth talking about, sorry i even spoke the fuck up
No worries.
-
This post did not contain any content.
Why would they be right beyond word sequence frecuencies?
-
-
-
1
-
Anthropic, tasked an AI with running a vending machine in its offices, sold at big loss while inventing people, meetings, and experiencing a bizarre identity crisis
Technology1
-
-
-
Google Restricts Android Sideloading—What It Means for User Autonomy and the Future of Mobile Freedom – Purism
Technology1
-