linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

AI agents wrong ~70% of time: Carnegie Mellon study

Technology

272 Beiträge 107 Kommentatoren 79 Aufrufe

S surph_ninja@lemmy.world

This is the same kind of short-sighted dismissal I see a lot in the religion vs science argument. When they hinge their pro-religion stance on the things science can’t explain, they’re defending an ever diminishing territory as science grows to explain more things. It’s a stupid strategy with an expiration date on your position.

All of the anti-AI positions, that hinge on the low quality or reliability of the output, are defending an increasingly diminished stance as the AI’s are further refined. And I simply don’t believe that the majority of the people making this argument actually care about the quality of the output. Even when it gets to the point of producing better output than humans across the board, these folks are still going to oppose it regardless. Why not just openly oppose it in general, instead of pinning your position to an argument that grows increasingly irrelevant by the day?

DeepSeek exposed the same issue with the anti-AI people dedicated to the environmental argument. We were shown proof that there’s significant progress in the development of efficient models, and it still didn’t change any of their minds. Because most of them don’t actually care about the environmental impacts. It’s just an anti-AI talking point that resonated with them.

The more baseless these anti-AI stances get, the more it seems to me that it’s a lot of people afraid of change and afraid of the fundamental economic shifts this will require, but they’re embarrassed or unable to articulate that stance. And it doesn’t help that the luddites haven’t been able to predict a single development. Just constantly flailing to craft a new argument to criticize the current models and tech. People are learning not to take these folks seriously.
C This user is from outside of this forum
C This user is from outside of this forum
chaonaut@lemmy.4d2.org

schrieb zuletzt editiert von

#155

Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.
S 1 Antwort Letzte Antwort

5
C chaonaut@lemmy.4d2.org

Maybe the marketers should be a bit more picky about what they slap "AI" on and maybe decision makers should be a little less eager to follow whatever Better Auto complete spits out, but maybe that's just me and we really should be pretending that all these algorithms really have made humans obsolete and generating convincing language is better than correspondence with reality.
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von

#156

I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.
C 1 Antwort Letzte Antwort

3
S surph_ninja@lemmy.world

I’m not sure the anti-AI marketing stance is any more solid of a position. Though it’s probably easier to defend, since it’s so vague and not based on anything measurable.
C This user is from outside of this forum
C This user is from outside of this forum
chaonaut@lemmy.4d2.org

schrieb zuletzt editiert von

#157

Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).
S 1 Antwort Letzte Antwort

3
C chaonaut@lemmy.4d2.org

Calling AI measurable is somewhat unfounded. Between not having a coherent, agreed-upon definition of what does and does not constitute an AI (we are, after all, discussing LLMs as though they were AGI), and the difficulty that exists in discussing the qualifications of human intelligence, saying that a given metric covers how well a thing is an AI isn't really founded on anything but preference. We could, for example, say that mathematical ability is indicative of intelligence, but claiming FLOPS is a proxy for intelligence falls rather flat. We can measure things about the various algorithms, but that's an awful long ways off from talking about AI itself (unless we've bought into the marketing hype).
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von surph_ninja@lemmy.world

#158

So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?
J C 2 Antworten Letzte Antwort

0
E eli001@lemmy.world

This post did not contain any content.
F This user is from outside of this forum
F This user is from outside of this forum
fogetaboutit@programming.dev

schrieb zuletzt editiert von

#159

please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro
S J 2 Antworten Letzte Antwort

78
Z zbyte64@awful.systems

It’s usually vastly easier to verify an answer than posit one, if you have the patience to do so.

I usually write 3x the code to test the code itself. Verification is often harder than implementation.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von jsomae@lemmy.ml

#160

It really depends on the context. Sometimes there are domains which require solving problems in NP, but where it turns out that most of these problems are actually not hard to solve by hand with a bit of tinkering. SAT solvers might completely fail, but humans can do it. Often it turns out that this means there's a better algorithm that can exploit commanalities in the data. But a brute force approach might just be to give it to an LLM and then verify its answer. Verifying NP problems is easy.

(This is speculation.)
1 Antwort Letzte Antwort

1
M mangocats@feddit.it

being able to do 30% of tasks successfully is already useful.

If you have a good testing program, it can be.

If you use AI to write the test cases...? I wouldn't fly on that airplane.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#161

obviously
1 Antwort Letzte Antwort

2
K knock_knock_lemmy_in@lemmy.world

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate.
LLMs don't get tired and they can be run in parallel.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#162

The problem is they are not i.i.d., so this doesn't really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we're already looking at "agents," so they're probably already doing chain-of-thought.
K 1 Antwort Letzte Antwort

1
M mangocats@feddit.it

I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It's a lot like machine translation. I speak fluent C++, but I don't speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.

I also don't speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.

Is this useful? When C++ is getting banned for "security concerns" and Rust is the required language, it's at least a little helpful.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#163

I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.
M 1 Antwort Letzte Antwort

0
O outhouseperilous@lemmy.dbzer0.com

No, it matters. Youre pushing the lie they want pushed.
J This user is from outside of this forum
J This user is from outside of this forum
jsomae@lemmy.ml

schrieb zuletzt editiert von

#164

Hitler liked to paint, doesn't make painting wrong. The fact that big tech is pushing AI isn't evidence against the utility of AI.

That common parlance is to call machine learning "AI" these days doesn't matter to me in the slightest. Do you have a definition of "intelligence"? Do you object when pathfinding is called AI? Or STRIPS? Or bots in a video game? Dare I say it, the main difference between those AIs and LLMs is their generality -- so why not just call it GAI at this point tbh. This is a question of semantics so it really doesn't matter to the deeper question. Doesn't matter if you call it AI or not, LLMs work the same way either way.
O 1 Antwort Letzte Antwort

0
S surph_ninja@lemmy.world

So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?
J This user is from outside of this forum
J This user is from outside of this forum
jakeroxs@sh.itjust.works

schrieb zuletzt editiert von

#165

I would definitely bet it's made up and poorly designed.

I wish that weren't the case because having actual data would be nice, but these are almost always funded with some sort of intentional slant, for example nic vape safety where they clearly don't use the product sanely and then make wild claims about how there's lead in the vapes!

Homie you're fucking running the shit completely dry for longer then any humans could possible actually hit the vape, no shit it's producing carcinogens.

Go burn a bunch of paper and directly inhale the smoke and tell me paper is dangerous.
S 1 Antwort Letzte Antwort

2
J jakeroxs@sh.itjust.works

I would definitely bet it's made up and poorly designed.

I wish that weren't the case because having actual data would be nice, but these are almost always funded with some sort of intentional slant, for example nic vape safety where they clearly don't use the product sanely and then make wild claims about how there's lead in the vapes!

Homie you're fucking running the shit completely dry for longer then any humans could possible actually hit the vape, no shit it's producing carcinogens.

Go burn a bunch of paper and directly inhale the smoke and tell me paper is dangerous.
S This user is from outside of this forum
S This user is from outside of this forum
surph_ninja@lemmy.world

schrieb zuletzt editiert von

#166

Agreed. 70% is astoundingly high for today’s models. Something stinks.
1 Antwort Letzte Antwort

1
B blackmist@feddit.uk

We have created the overconfident intern in digital form.
J This user is from outside of this forum
J This user is from outside of this forum
jumping_redditor@sh.itjust.works

schrieb zuletzt editiert von

#167

Unfortunately marketing tries to sell it as a senior everything ologist
1 Antwort Letzte Antwort

14
T tja@programming.dev

DocumentDB is not for one drive documents (PDFs and such). It's for "documents" as in serialized objects (json or bson).
S This user is from outside of this forum
S This user is from outside of this forum
shayeta@feddit.org

schrieb zuletzt editiert von

#168

That's even better, I can just jam something in before it and churn the documents through an embedding model, thanks!
1 Antwort Letzte Antwort

1
E eli001@lemmy.world

This post did not contain any content.
S This user is from outside of this forum
S This user is from outside of this forum
socialmediarefugee@lemmy.world

schrieb zuletzt editiert von socialmediarefugee@lemmy.world

#169

I use it for very specific tasks and give as much information as possible. I usually have to give it more feedback to get to the desired goal. For instance I will ask it how to resolve an error message. I've even asked it for some short python code. I almost always get good feedback when doing that. Asking it about basic facts works too like science questions.

One thing I have had problems with is if the error is sort of an oddball it will give me suggestions that don't work with my OS/app version even though I gave it that info. Then I give it feedback and eventually it will loop back to its original suggestions, so it couldn't come up with an answer.

I've also found differences in chatgpt vs MS copilot with chatgpt usually being better results.
1 Antwort Letzte Antwort

2
F fogetaboutit@programming.dev

please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro
S This user is from outside of this forum
S This user is from outside of this forum
socialmediarefugee@lemmy.world

schrieb zuletzt editiert von

#170

And let it suck up 10% or so of all of the power in the region.
A 1 Antwort Letzte Antwort

12
M mangocats@feddit.it

The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn't show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.
S This user is from outside of this forum
S This user is from outside of this forum
socialmediarefugee@lemmy.world

schrieb zuletzt editiert von

#171

I've had good results being very specific, like "Generate some python 3 code for me that converts X to Y, recursively through all subdirectories, and converts the files in place."
M 1 Antwort Letzte Antwort

0
O outhouseperilous@lemmy.dbzer0.com

It's absolutely dangerous but it doesnt have to work even a little to do damage; hell, it already has. Your thing just makes it sound much more capable than it is. And it is not.

Also, it's not AI.

Edit: and in a comment replying to this one, one of your fellow fanboys proved

everyone knows how they work

Wrong
J This user is from outside of this forum
J This user is from outside of this forum
jumping_redditor@sh.itjust.works

schrieb zuletzt editiert von

#172

the industrial revolution could be seen as dangerous, yet it brought the highest standard of living increase in centuries
1 Antwort Letzte Antwort

1
S surph_ninja@lemmy.world

So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?
C This user is from outside of this forum
C This user is from outside of this forum
chaonaut@lemmy.4d2.org

schrieb zuletzt editiert von

#173

I mean, sure, in that the expectation is that the article is talking about AI in general. The cited paper is discussing LLMs and their ability to complete tasks. So, we have to agree that LLMs are what we mean by AI, and that their ability to complete tasks is a valid metric for AI. If we accept the marketing hype, then of course LLMs are exactly what we've been talking about with AI, and we've accepted LLMs features and limitations as what AI is. If LLMs are prone to filling in with whatever closest fits the model without regard to accuracy, by accepting LLMs as what we mean by AI, then AI fits to its model without regard to accuracy.
S 1 Antwort Letzte Antwort

0
J jsomae@lemmy.ml

I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.
M This user is from outside of this forum
M This user is from outside of this forum
mangocats@feddit.it

schrieb zuletzt editiert von

#174

I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: <pasted error message>" and a bit more than half of the time it is able to respond with a working fix.
J 1 Antwort Letzte Antwort

2

Anmelden zum Antworten

L

Your smart bulbs record 78% of conversations even when you think they're off
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
11

1

26 Stimmen

11 Beiträge

34 Aufrufe

F

Absolute horseshit. Bulbs don't have microphones. If they did, any junior security hacker could sniff out the traffic and post about it for cred. The article quickly pivots to TP-Link and other devices exposing certificates. That has nothing to do with surveillance and everything to do with incompetent programming. Then it swings over to Matter and makes a bunch of incorrect assertion I don't even care to correct. Also, all the links are to articles on the same site, every single one of which is easily refutable crap. Yes, there are privacy tradeoffs with connected devices, but this article is nothing but hot clickbait garbage.
M

‘I blame Facebook’: Aaron Sorkin is writing a Social Network sequel for the post-Zuckerberg era
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
19

1

336 Stimmen

19 Beiträge

76 Aufrufe

R

What I'm speaking about is that it should be impossible to do some things. If it's possible, they will be done, and there's nothing you can do about it. To solve the problem of twiddled social media (and moderation used to assert dominance) we need a decentralized system of 90s Web reimagined, and Fediverse doesn't deliver it - if Facebook and Reddit are feudal states, then Fediverse is a confederation of smaller feudal entities. A post, a person, a community, a reaction and a change (by moderator or by the user) should be global entities (with global identifiers, so that the object by id of #0000001a2b3c4d6e7f890 would be the same object today or 10 years later on every server storing it) replicated over a network of servers similarly to Usenet (and to an IRC network, but in an IRC network servers are trusted, so it's not a good example for a global system). Really bad posts (or those by persons with history of posting such) should be banned on server level by everyone. The rest should be moderated by moderator reactions\changes of certain type. Ideally, for pooling of resources and resilience, servers would be separated by types into storage nodes (I think the name says it, FTP servers can do the job, but no need to be limited by it), index nodes (scraping many storage nodes, giving out results in structured format fit for any user representation, say, as a sequence of posts in one community, or like a list of communities found by tag, or ... , and possibly being connected into one DHT for Kademlia-like search, since no single index node will have everything), and (like in torrents?) tracker nodes for these and for identities, I think torrent-like announce-retrieve service is enough - to return a list of storage nodes storing, say, a specified partition (subspace of identifiers of objects, to make looking for something at least possibly efficient), or return a list of index nodes, or return a bunch of certificates and keys for an identity (should be somehow cryptographically connected to the global identifier of a person). So when a storage node comes online, it announces itself to a bunch of such trackers, similarly with index nodes, similarly with a user. One can also have a NOSTR-like service for real-time notifications by users. This way you'd have a global untrusted pooled infrastructure, allowing to replace many platforms. With common data, identities, services. Objects in storage and index services can be, say, in a format including a set of tags and then the body. So a specific application needing to show only data related to it would just search on index services and display only objects with tags of, say, "holo_ns:talk.bullshit.starwars" and "holo_t:post", like a sequence of posts with ability to comment, or maybe it would search objects with tags "holo_name:My 1999-like Star Wars holopage" and "holo_t:page" and display the links like search results in Google, and then clicking on that you'd see something presented like a webpage, except links would lead to global identifiers (or tag expressions interpreted by the particular application, who knows). (An index service may return, say, an array of objects, each with identifier, tags, list of locations on storage nodes where it's found or even bittorrent magnet links, and a free description possibly ; then the user application can unify responses of a few such services to avoid repetitions, maybe sort them, represent them as needed, so on.) The user applications for that common infrastructure can be different at the same time. Some like Facebook, some like ICQ, some like a web browser, some like a newsreader. (Star Wars is not a random reference, my whole habit of imagining tech stuff is from trying to imagine a science fiction world of the future, so yeah, this may seem like passive dreaming and it is.)
U

Understanding the Debate on AI in Electronic Health Records
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
5

1

23 Stimmen

5 Beiträge

26 Aufrufe

T

Well yeah exactly why I said "the same risk". ideally it's going to be in the same systems... and assuming no one is stupid enough (or the laws don't let them) attach it to the publicly accessible forms of existing AIs It's not a new additional risk, just the same one. (though those assumptions are largely there own risks.
W

YouTube’s new anti-adblock measures
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
57

217 Stimmen

57 Beiträge

214 Aufrufe

M

I wish I could create playlists on Nebula.
P

IRS tax filing software released to the people as free software
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
14

288 Stimmen

14 Beiträge

39 Aufrufe

P

Only if you're a scumbag/useful idiot.
D

A judge set the timeline for the Amazon antitrust trial, which starts on February 9, 2027
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

1

33 Stimmen

2 Beiträge

20 Aufrufe

R

Woah in 2 years, that will be definitly not be forgotten until then....
A

Live facial recognition cameras may become ‘commonplace’ as police use soars
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
11

1

168 Stimmen

11 Beiträge

46 Aufrufe

A

Law enforcement officer
A

Research shows more than 80% of AI projects fail, wasting billions of dollars in capital and resources: Report
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

1

0 Stimmen

2 Beiträge

9 Aufrufe

P

It's a shame. AI has potential but most people just want to exploit its development for their own gain.