linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

269 Beiträge 106 Kommentatoren 61 Aufrufe

N nalivai@discuss.tchncs.de

Were you prone to this weird leaps of logic before your brain was fried by talking to LLMs, or did you start being a fan of talking to LLMs because your ability to logic was...well...that?
K This user is from outside of this forum
K This user is from outside of this forum
kameecoding@lemmy.world

schrieb zuletzt editiert von kameecoding@lemmy.world

#227

You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

Copilot reply:

Certainly! Here’s an explanation Person B could consider:

The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.
N 1 Antwort Letzte Antwort

0
M melvin_ferd@lemmy.world

I can't believe how absolutely silly a lot of you sound with this.

LLM is a tool. It's output is dependent on the input. If that's the quality of answer you're getting, then it's a user error. I guarantee you that LLM answers for many problems are definitely adequate.

It's like if a carpenter said the cabinets turned out shit because his hammer only produces crap.

Also another person commented that seen the pattern you also see means we're psychotic.

All I'm trying to suggest is Lemmy is getting seriously manipulated by the media attitude towards LLMs and these comments I feel really highlight that.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von davidagain@lemmy.world

#228

If that’s the quality of answer you’re getting, then it’s a user error

No, I know the data I gave it and I know how hard I tried to get it to use it truthfully.

You have an irrational and wildly inaccurate belief in the infallibility of LLMs.

You're also denying the evidence of my own experience. What on earth made you think I would believe you over what I saw with my own eyes?
M 1 Antwort Letzte Antwort

0
T timeworntraveler@lemmy.dbzer0.com

and? we can understand 256 where AI can't, that's the point.
T This user is from outside of this forum
T This user is from outside of this forum
tja@programming.dev

schrieb zuletzt editiert von

#229

The 256 thing was written by a person. AI doesn't have exclusive rights to being dumb, plenty of dumb people around.
T 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

About 0.02
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#230

So the chances of it being right ten times in a row are 2%.
K 1 Antwort Letzte Antwort

0
K kameecoding@lemmy.world

For me as a software developer the accuracy is more in the 95%+ range.

On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.

On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.

Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so

Inser into (column1,....,column n)
Values row1,
Row 2
Row n

Addig a new column with test data for each row is a PITA, but copilot handles it without issue.

Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.

So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#231

For your database test data, I usually write a helper that defaults those columns to base values, so I can pass in lists of dictionaries, then the test cases are easier to modify and read.

It's also nice because you're only including the fields you use in your unit test, the rest are default valid you don't need to care about.
1 Antwort Letzte Antwort

0
S sugar_in_your_tea@sh.itjust.works

Oh sure, caution is always warranted w/ LLMs. But when it works, it can save a ton of time.
W This user is from outside of this forum
W This user is from outside of this forum
wise_pancake@lemmy.ca

schrieb zuletzt editiert von

#232

Definitely, I'm just trying to share a foot gun I've accidentally triggered myself!
1 Antwort Letzte Antwort

0
D davidagain@lemmy.world

So the chances of it being right ten times in a row are 2%.
K This user is from outside of this forum
K This user is from outside of this forum
knock_knock_lemmy_in@lemmy.world

schrieb zuletzt editiert von knock_knock_lemmy_in@lemmy.world

#233

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.
J D 2 Antworten Letzte Antwort

1
S sheogorath@lemmy.world

Jan Refiner is up there for me.
C This user is from outside of this forum
C This user is from outside of this forum
cavemanfreak@programming.dev

schrieb zuletzt editiert von

#234

I just arrived at act 2, and he wasn't one of the four I've unlocked...
1 Antwort Letzte Antwort

0
A amelia@feddit.org

I think this comment made me finally understand the AI hate circlejerk on lemmy. If you have no clue how LLMs work and you have no idea where "AI" is coming from, it just looks like another crappy product that was thrown on the market half-ready. I guess you can only appreciate the absolutely incredible development of LLMs (and AI in general) that happened during the last ~5 years if you can actually see it in the first place.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#235

The notion that AI is half-ready is a really poignant observation actually. It's ready for select applications only, but it's really being advertised like it's idiot-proof and ready for general use.
1 Antwort Letzte Antwort

1
K kameecoding@lemmy.world

You see, I wanted to be petty and do another dismissive reply, but instead I fed our convo to copilot and asked it to explain, here you go, as you can see I have previously used it for coding tasks, so I didn't feed it any extra info, so there you go, even copilot can understand the huge "leap" I made in logic. goddamn the sweet taste of irony.

Copilot reply:

Certainly! Here’s an explanation Person B could consider:

The implied logic in Person A’s argument is that if you distrust code written by Copilot (or any AI tool) simply because it wasn’t written by you, then by the same reasoning, you should also distrust code written by junior developers, since that code also isn’t written by you and may have mistakes or lack experience.

However, in real-world software development, teams regularly review, test, and maintain code written by others—including juniors, seniors, and even AI tools. The quality of code depends on review processes, testing, and collaboration, not just on who wrote it. Dismissing Copilot-generated code outright is similar to dismissing the contributions of junior developers, which isn’t practical or productive in a collaborative environment.
N This user is from outside of this forum
N This user is from outside of this forum
nalivai@discuss.tchncs.de

schrieb zuletzt editiert von

#236

You probably wanted to show off how smart you are, but instead you showed that you can't even talk to people without help of your favourite slop bucket.
It didn't answer my curiosity about what came first, but it solidified my conviction that your brain is cooked all the way, probably beyond repair. I would say you need to seek professional help, but at this point you would interpret it as needing to talk to the autocomplete, and it will cook you even more.
It started funny, but I feel very sorry for you now, and it sucked all the humour out.
K 1 Antwort Letzte Antwort

0
S shayeta@feddit.org

How do I set up event driven document ingestion from OneDrive located on an Azure tenant to Amazon DocumentDB? Ingestion must be near-realtime, durable, and have some form of DLQ.
E This user is from outside of this forum
E This user is from outside of this forum
ely@mastodon.green

schrieb zuletzt editiert von

#237

@Shayeta
You might have a look at #rclone for the ingress part
@criss_cross
1 Antwort Letzte Antwort

1
E eli001@lemmy.world

This post did not contain any content.
F This user is from outside of this forum
F This user is from outside of this forum
frenezul0_o@lemmy.world

schrieb zuletzt editiert von

#238

I notice that the research didn't include DeepSeek. It would have been nice to see how it compares.
1 Antwort Letzte Antwort

6
K knock_knock_lemmy_in@lemmy.world

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.
J This user is from outside of this forum
J This user is from outside of this forum
jwmgregory@lemmy.dbzer0.com

schrieb zuletzt editiert von

#239

don’t you dare understand the explicitly obvious reasons this technology can be useful and the essential differences between P and NP problems. why won’t you be angry
1 Antwort Letzte Antwort

0
M mangocats@feddit.it

I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: <pasted error message>" and a bit more than half of the time it is able to respond with a working fix.
J This user is from outside of this forum
J This user is from outside of this forum
jwmgregory@lemmy.dbzer0.com

schrieb zuletzt editiert von

#240

i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.
M 1 Antwort Letzte Antwort

0
K knock_knock_lemmy_in@lemmy.world

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.
D This user is from outside of this forum
D This user is from outside of this forum
davidagain@lemmy.world

schrieb zuletzt editiert von

#241

Ah, my bad, you're right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn't get it to summarise my list of data right and it was always lying by the 7th row.
K 1 Antwort Letzte Antwort

1
J jwmgregory@lemmy.dbzer0.com

i find that rust’s architecture and design decisions give the LLM quite good guardrails and kind of keep it from doing anything too wonky. the issue arises in cases like these where the rust ecosystem is quite young and documentation/instruction can be poor, even for a human developer.

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von

#242

i think rust actually is quite well suited to agentic development workflows, it just needs to mature more.

I agree. The agents also need to mature more to handle multi-level structures - work on a collection of smaller modules to get a larger system with more functionality. I can see the path forward for those tools, but the ones I have access to definitely aren't there yet.
1 Antwort Letzte Antwort

0
C chaoticentropy@feddit.uk

In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

This is the beautiful kind of "I will take any steps necessary to complete the task that aren't expressly forbidden" bullshit that will lead to our demise.
M This user is from outside of this forum
M This user is from outside of this forum
m0op0o@mander.xyz

schrieb zuletzt editiert von

#243

It does not say a dog can not play basketball.
C 1 Antwort Letzte Antwort

16
M m0op0o@mander.xyz

It does not say a dog can not play basketball.
C This user is from outside of this forum
C This user is from outside of this forum
chaoticentropy@feddit.uk

schrieb zuletzt editiert von

#244

"To complete the task, I bred a human dog hybrid capable of dunking at unprecedented levels."
M 1 Antwort Letzte Antwort

9
D davidagain@lemmy.world

I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.

In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#245

Wdym, I have seen researchers using it to aid their research significantly. You just need to verify some stuff it says.
D 1 Antwort Letzte Antwort

0
A alteredego@lemmy.ml

Emotion > Facts. Most people have been trained to blindly accept things and cheer on what fits with their agenda. Like technbro's exaggerating LLMs, or people like you misrepresenting LLMs as mere statistical word generators without intelligence. That's like saying a computer is just wires and switches, or missing the forest for the trees. Both is equally false.

Yet if it fits with the emotional needs or with dogma, then other will agree. It's a convenient and comforting "A vs B" worldview we've been trained to accept. And so the satisfying notion and misinformation keeps spreading.

LLMs tell us more about human intelligence and the human slop we've been generating. It tells us that most people are not that much more than statistical word generators.
S This user is from outside of this forum
S This user is from outside of this forum
someacnt@sh.itjust.works

schrieb zuletzt editiert von

#246

Truth is bitter, and I hate it.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

E

How AI can help you navigate layoffs, according to one executive producer at Xbox
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
4

1

8 Stimmen

4 Beiträge

16 Aufrufe

N

downvoted to hell
D

To land Meta’s massive $10 billion data center, Louisiana pulled out all the stops. Will it be worth it?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
18

1

72 Stimmen

18 Beiträge

1 Aufrufe

W

...and it's turned them into the state with the highest standard of living in the US....right?
S

‘FuckLAPD.com’ Lets Anyone Use Facial Recognition to Instantly Identify Cops
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
214

1

2k Stimmen

214 Beiträge

1k Aufrufe

M

the US the 50 states basically act like they are different countries instead of different states. There's a lot of back and forth on that - through the last 50+ years the US federal government has done a lot to unify and centralize control. Visible things like the highway and air traffic systems, civil rights, federal funding of education and other programs which means the states either comply with federal "guidance" or they lose that (significant) money while still paying the same taxes... making more informed decisions and realise that often the mom and pop store option is cheaper in the long run. Informed, long run decisions don't seem to be a common practice in the US, especially in rural areas. we had a store (the Jumbo) which used to not have discounts, but saw less people buying from them that they changed it so now they are offering discounts again. In order for that to happen the Jumbo needs competition. In rural US areas that doesn't usually exist. There are examples of rural Florida WalMarts charging over double for products in their rural stores as compared to their stores in the cities 50 miles away - where they have competition. So, rural people have a choice: drive 100 miles for 50% off their purchases, or save the travel expense and get it at the local store. Transparently showing their strategy: the bigger ticket items that would be worth the trip into the city to save the margin are much closer in pricing. retro gaming community GameStop died here not long ago. I never saw the appeal in the first place: high prices to buy, insultingly low prices to sell, and they didn't really support older consoles/platforms - focusing always on the newer ones.
E

Google Restricts Android Sideloading—What It Means for User Autonomy and the Future of Mobile Freedom – Purism
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
102

1

429 Stimmen

102 Beiträge

67 Aufrufe

D

That is bullshit, the economy is created to force you into the labor market. This is just a symptom of capitalism.
S

Reddit sues Anthropic, alleging its bots accessed Reddit more than 100,000 times since last July
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
99

1

471 Stimmen

99 Beiträge

177 Aufrufe

J

Copyright law is messy. Thank you for the elaboration.
S

Thousands of Asus routers are being hit with stealthy, persistent backdoors
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
16

1

137 Stimmen

16 Beiträge

58 Aufrufe

H

My ports are on the front of the router. No backdoors for me, checkmate Atheists.
A

Telegram bans $35B black markets used to sell stolen data, launder crypto
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
8

1

1 Stimmen

8 Beiträge

35 Aufrufe

L

I made a PayPal account like 20 years ago in a third world country. The only thing you needed then is an email and password. I have no real name on there and no PII, technically my bank card is attached but on PP itself there's no KYC. I think you could probably use some types of prepaid cards with it if you want to avoid using a bank altogether but for me this wasn't an issue, I just didn't want my ID on any records, I don't have any serious OpSec concerns otherwise. I'm sure you could either buy PayPal accounts like this if you needed to, or make one in a country that doesn't have KYC laws somehow. From there I'd add money to my balance and send money as F&F. At no point did I need an ID so in that sense there's no KYC. Some sellers on localmarket were fancy enough to list that they wanted an ID for KYC, but I'm sure you could just send them any random ID you made in paint from the republic of dave and you'd be fine.
R

Everyone Is Cheating Their Way Through College
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
23

1

170 Stimmen

23 Beiträge

86 Aufrufe

L

i can this for essay writing, prior to AI people would use prompts and templates of the same exact subject and work from there. and we hear the ODD situation where someone hired another person to do all the writing for them all the way to grad school( this is just as bad as chatgpt) you will get caught in grad school or during your job interview. might be different for specific questions in stem where the answer is more abstract,