linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

272 Beiträge 107 Kommentatoren 79 Aufrufe

M mangocats@feddit.it

I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: <pasted error message>" and a bit more than half of the time it is able to respond with a working fix.
J This user is from outside of this forum
J This user is from outside of this forum
jwmgregory@lemmy.dbzer0.com

schrieb zuletzt editiert von

#240

i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.
M 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#241

Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.
K 1 Antwort Letzte Antwort

1
J jwmgregory@lemmy.dbzer0.com

i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von

#242

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.

I agree. The agents also need to mature more to handle multi-level structures - work on a collection of smaller modules to get a larger system with more functionality. I can see the path forward for those tools, but the ones I have access to definitely aren't there yet.
1 Antwort Letzte Antwort

0
C chaoticentropy@feddit.uk

In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

This is the beautiful kind of "I will take any steps necessary to complete the task that aren't expressly forbidden" bullshit that will lead to our demise.
M This user is from outside of this forum
M This user is from outside of this forum
m0op0o@mander.xyz

schrieb zuletzt editiert von

#243

It does not say a dog can not play basketball.
C 1 Antwort Letzte Antwort

17
M m0op0o@mander.xyz

It does not say a dog can not play basketball.
C This user is from outside of this forum
C This user is from outside of this forum
chaoticentropy@feddit.uk

schrieb zuletzt editiert von

#244

"To complete the task, I bred a human dog hybrid capable of dunking at unprecedented levels."
M 1 Antwort Letzte Antwort

9
D davidagain@lemmy.world

I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#245

Wdym, I have seen researchers using it to aid their research significantly. You just need to verify some stuff it says.
D 1 Antwort Letzte Antwort

0
A alteredego@lemmy.ml

Emotion > Facts. Most people have been trained to blindly accept things and cheer on what fits with their agenda. Like technbro's exaggerating LLMs, or people like you misrepresenting LLMs as mere statistical word generators without intelligence. That's like saying a computer is just wires and switches, or missing the forest for the trees. Both is equally false.

Yet if it fits with the emotional needs or with dogma, then other will agree. It's a convenient and comforting "A vs B" worldview we've been trained to accept. And so the satisfying notion and misinformation keeps spreading.

LLMs tell us more about human intelligence and the human slop we've been generating. It tells us that most people are not that much more than statistical word generators.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#246

Truth is bitter, and I hate it.
1 Antwort Letzte Antwort

0
S someacnt@sh.itjust.works

Wdym, I have seen researchers using it to aid their research significantly. You just need to verify some stuff it says.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#247

Verify every single bloody line of output. Top three to five are good, then it starts guessing the rest based on the pattern so far. If I wanted to make shit up randomly, I would do it myself.

People who trust LLMs to tell them things that are right rather than things that sound right have fundamentally misunderstood what an LLM is and how it works.
S 1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

Verify every single bloody line of output. Top three to five are good, then it starts guessing the rest based on the pattern so far. If I wanted to make shit up randomly, I would do it myself.

People who trust LLMs to tell them things that are right rather than things that sound right have fundamentally misunderstood what an LLM is and how it works.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#248

It's not that bad, the output isn't random.
Time to time, it can produce novel stuffs like new equations for engineering.
Also, verification does not take that much effort. At least according to my colleagues, it is great.
Also works well for coding well-known stuffs, as well!
D 1 Antwort Letzte Antwort

0
C chaoticentropy@feddit.uk

"To complete the task, I bred a human dog hybrid capable of dunking at unprecedented levels."
M This user is from outside of this forum
M This user is from outside of this forum
m0op0o@mander.xyz

schrieb zuletzt editiert von

#249

"Where are my balls Summer?"
C 1 Antwort Letzte Antwort

5
J jsomae@lemmy.ml

I'd just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time -- Amazon's new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#250

Thing is, they might achieve 99% accuracy given the speed of progress. Lots of brainpower is getting poured into LLMs.
Honestly, it is soo scary. It could be replacing me...
J 1 Antwort Letzte Antwort

1
S someacnt@sh.itjust.works

Thing is, they might achieve 99% accuracy given the speed of progress. Lots of brainpower is getting poured into LLMs.
Honestly, it is soo scary. It could be replacing me...
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#251

yeah, this is why I'm #fuck-ai to be honest.
1 Antwort Letzte Antwort

0
M m0op0o@mander.xyz

"Where are my balls Summer?"
C This user is from outside of this forum
C This user is from outside of this forum
chaoticentropy@feddit.uk

schrieb zuletzt editiert von

#252

The first dunk is the hardest
1 Antwort Letzte Antwort

3
S someacnt@sh.itjust.works

It's not that bad, the output isn't random.
Time to time, it can produce novel stuffs like new equations for engineering.
Also, verification does not take that much effort. At least according to my colleagues, it is great.
Also works well for coding well-known stuffs, as well!
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#253

It's not completely random, but I'm telling you it fucked up, it fucked up badly, time after time, and I had to check every single thing manually. It's correctness run never lasted beyond a handful. If you build something using some equation it invented you're insane and should quit engineering before you hurt someone.
1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
D This user is from outside of this forum
D This user is from outside of this forum
dan69@lemmy.world

schrieb zuletzt editiert von

#254

And it won’t be until humans can agree on what’s a fact and true vs not.. there is always someone or some group spreading mis/dis-information
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

If that’s the quality of answer you’re getting, then it’s a user error

No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?
M This user is from outside of this forum
M This user is from outside of this forum
melvin_ferd@lemmy.world

schrieb zuletzt editiert von melvin_ferd@lemmy.world

#255

Why are you giving it data. It's a chat and language tool. It's not data based. You need something trained to work for that specific use. I think Wolfram Alpha has better tools for that.

I wouldn't trust it to calculate how many patio stones I need to build a project. But I trust it to tell me where a good source is on a topic or if a quote was said by who ever or if I need to remember something but I only have vague pieces like old timey historical witch burning related factoid about villagers who pulled people through a hole in the church wall or what was a the princess who was skeptic and sent her scientist to villages to try to calm superstitious panic .

Other uses are like digging around my computer and seeing what processes do what. How concepts work regarding the think I'm currently learning. So many excellent users. But I fucking wouldn't trust it to do any kind of calculation.
D 1 Antwort Letzte Antwort

0
N nalivai@discuss.tchncs.de

You probably wanted to show off how smart you are, but instead you showed that you can't even talk to people without help of your favourite slop bucket.
It didn't answer my curiosity about what came first, but it solidified my conviction that your brain is cooked all the way, probably beyond repair. I would say you need to seek professional help, but at this point you would interpret it as needing to talk to the autocomplete, and it will cook you even more.
It started funny, but I feel very sorry for you now, and it sucked all the humour out.
K This user is from outside of this forum
K This user is from outside of this forum
kameecoding@lemmy.world

schrieb zuletzt editiert von kameecoding@lemmy.world

#256

You just can't talk to people, period, you are just a dick, you were also just proven to be stupider than a fucking LLM, have a nice day
N 1 Antwort Letzte Antwort

0
S szczuroarturo@programming.dev

I actually have a fairly positive experience with ai ( copilot using claude specificaly ). Is it wrong a lot if you give it a huge task yes, so i dont do that and using as a very targeted solution if i am feeling very lazy today . Is it fast . Also not . I could actually be faster than ai in some cases.
But is it good if you are working for 6h and you just dont have enough mental capacity for the rest of the day. Yes . You can just prompt it specificaly enough to get desired result and just accept correct responses. Is it always good ,not really but good enough. Do i also suck after 3pm . Yes.
My main issue is actually the fact that it saves first and then asks you to pick if you want to use it. Not a problem usualy but if it crashes the generated code stays so that part sucks
J This user is from outside of this forum
J This user is from outside of this forum
jcg@halubilo.social

schrieb zuletzt editiert von

#257

You should give Claude Code a shot if you have a Claude subscription. I'd say this is where AI actually does a decent job: picking up human slack, under supervision, not replacing humans at anything. AI tools won't suddenly be productive enough to employ, but I as a professional can use it to accelerate my own workflow. It's actually where the risk of them taking jobs is real: for example, instead of 10 support people you can have 2 who just supervise the responses of an AI.

But of course, the Devil's in the detail. The only reason this is cost effective is because of VC money subsidizing and hiding the real cost of running these models.
1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

Why are you giving it data. It's a chat and language tool. It's not data based. You need something trained to work for that specific use. I think Wolfram Alpha has better tools for that.

I wouldn't trust it to calculate how many patio stones I need to build a project. But I trust it to tell me where a good source is on a topic or if a quote was said by who ever or if I need to remember something but I only have vague pieces like old timey historical witch burning related factoid about villagers who pulled people through a hole in the church wall or what was a the princess who was skeptic and sent her scientist to villages to try to calm superstitious panic .

Other uses are like digging around my computer and seeing what processes do what. How concepts work regarding the think I'm currently learning. So many excellent users. But I fucking wouldn't trust it to do any kind of calculation.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#258

Why are you giving it data

Because there's a button for that.

It’s output is dependent on the input

This thing that you said... It's false.
M 1 Antwort Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von davidagain@lemmy.world

#259

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper
U 1 Antwort Letzte Antwort

6

Anmelden zum Antworten

A

VMware’s rivals ramp efforts to create alternative stacks
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
7

1

60 Stimmen

7 Beiträge

7 Aufrufe

B

I do the same in Proxmox VMs, in my homelab, which is... fine. I was talking more about native support, manageable via an API or something. Say I need to increase the number of nodes in my cluster. I spin up a new VM using the template I have, adjust the network configuration, update the packages, add it to the cluster. Oh, maybe I should also do an update on all of them while I'm there, because now the new machine runs a different docker version. I have some Ansible and bash scripts that automates most of this. It works for my homelab. At work however, I have a handful of clusters, with dozens of nodes. The method above can become tedious fast and it's prone to human errors. We use external Kubernetes as a service platforms (think DOKS, EKS, etc), who have Terraform providers available. So I open my Terraform config and increase the number of nodes in one of my pre-production clusters from 9 to 11. I also change the version from 1.32 to 1.33. I then push my changes to a new merge request, my Gitlab CI spins up, who calls Atlantis to run a terraform plan, I check the results and ask it to apply. It takes 2 minutes. I would love to see this work with Proxmox.
P

Russian Internet users are unable to access the open Internet
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
30

1

360 Stimmen

30 Beiträge

119 Aufrufe

Z

Also don't forget all the suicides happening with hard to obtain poisons and shooting oneself in the back of the head three times.
P

The Career Calamity: Monster. com and CareerBuilder, Two of the most prominent legacy job application sites file for Chapter 11 bankruptcy. Together. Maybe they lost their edge.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
15

1

73 Stimmen

15 Beiträge

18 Aufrufe

L

same, i however dont subscribe to thier "contact you by recruiters, since you get flooded with indian recruiters of questionable positions, and jobs im not eligible for. unfortunately for the field i was trying to get into, wasnt helping so i found just a regular job in the mean time.
D

The Guardian and Cambridge University's Department of Computer Science unveil new secure technology to protect sources
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
64

1

327 Stimmen

64 Beiträge

227 Aufrufe

B

I get that, but it's more logical to me that of I'm going to whistleblow on a company to not use one of their devices to do it. That way it doesn't matter what apps are or are not secure, you're not using their device that can potentially track you.
P

Spyware and state abuse: The case for an EU-wide ban
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

1

54 Stimmen

2 Beiträge

16 Aufrufe

M

I'm surprised it isn't already illegal to install software on someone's phone without their consent or knowledge. Sounds like a form of property damage.
M

The hidden time bomb in the tax code that's fueling mass tech layoffs
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
31

1

210 Stimmen

31 Beiträge

118 Aufrufe

T

In 2025 it would be anything above 3.6 million. It's a ton of money but here's a list of a few people that hit it. https://aflcio.org/paywatch/highest-paid-ceos Now if they added in a progressive tax rate for corporate taxes as well.... Say anything over 500 million in net profit is taxed at a 90+% rate. That would solve all sorts of issues. Suddenly investors of all these mega corps would be pushing hard to divide up the companies into smaller entities. Wealth tax in the modern age could be an inheritance tax. Anything over the median life earnings of individuals could be taxed at 100%. So median earnings in my area is $65K * 45 years (20-65k) = $2.93 million.
T

Telegram partners with xAI to bring Grok to over a billion users
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
36

1

38 Stimmen

36 Beiträge

125 Aufrufe

R

So you pay taxes to Putin. Good to know who actually helps funding the regime. I suggest you go someplace else. I won't take this from a jerk from likely one of the countries buying fossil fuels from said regime, that have also supported it after a few falsified elections starting in 1996, which is also the year I was born. And of course "paying taxes to Putin" can't be even compared to what TG is doing, so just shut up and go do something you know how to do, like I dunno what.
A

Palantir’s Idea of Peace
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
12

22 Stimmen

12 Beiträge

50 Aufrufe

A

"Totally not a narc, inc."