AI agents wrong ~70% of time: Carnegie Mellon study
-
Did you make it? Or did you prompt it? They ain't quite the same.
It calls ollama with a prompt, it's a bit complex because it renames and moves stuff too and sorts it.
-
It's absolutely dangerous but it doesnt have to work even a little to do damage; hell, it already has. Your thing just makes it sound much more capable than it is. And it is not.
Also, it's not AI.
semantics.
-
semantics.
No, it matters. Youre pushing the lie they want pushed.
-
I won't tolerate Jan slander here. I know he's just a builder, but his life path has the most probability of having a great person out of it!
I'd say Jan Botanist is also up there as being a pretty great person.
-
I'd say Jan Botanist is also up there as being a pretty great person.
Jan Refiner is up there for me.
-
For me as a software developer the accuracy is more in the 95%+ range.
On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.
On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.
Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so
Inser into (column1,....,column n)
Values row1,
Row 2
Row nAddig a new column with test data for each row is a PITA, but copilot handles it without issue.
Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.
So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.
Keep doing what you do. Your company will pay me handsomely to throw out all your bullshit and write working code you can trust when you're done. If your company wants to have a product in the future that is.
-
It doesn't matter if you need a human to review. AI has no way distinguishing between success and failure. Either way a human will have to review 100% of those tasks.
A human can review something close to correct a lot better than starting the task from zero.
-
Keep doing what you do. Your company will pay me handsomely to throw out all your bullshit and write working code you can trust when you're done. If your company wants to have a product in the future that is.
Lmao, okay buddy, based on how many interviews I have sat on in, the chances that you are a worse programmer than me are much higher than you being better than me.
Being a pompous ass dismissive of new tooling makes you chances even worse
-
Ok what about tech journalists who produced articles with those misunderstandings. Surely they know better yet still produce articles like this. But also people who care enough about this topic to post these articles usually I assume know better yet still spread this crap
-
For me as a software developer the accuracy is more in the 95%+ range.
On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.
On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.
Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so
Inser into (column1,....,column n)
Values row1,
Row 2
Row nAddig a new column with test data for each row is a PITA, but copilot handles it without issue.
Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.
So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.
Yeah, it (in my case, ChatGPT) has been great for helping me along with functions I'm only passingly familiar with / trying to use in new ways.
One that I was really surprised with was that it gave me a surprisingly robust, sensible, and (seemingly) well tuned-to-my-case check list of things to inspect for a used car I intend to buy. I'm already mostly familiar with what I'm doing there, but it pointed to some things I might've overlooked / didn't know were points of concern for the specific vehicle I'm looking at.
-
This post did not contain any content.
They've done studies, you know. 30% of the time, it works every time.
-
This post did not contain any content.
I dont know why but I am reminded of this clip about eggless omelette https://youtu.be/9Ah4tW-k8Ao
-
A human can review something close to correct a lot better than starting the task from zero.
It is a lot harder to notice incorrect information in review, than making sure it is correct when writing it.
-
Lmao, okay buddy, based on how many interviews I have sat on in, the chances that you are a worse programmer than me are much higher than you being better than me.
Being a pompous ass dismissive of new tooling makes you chances even worse
I’ve been in the industry awhile and your assessment is dead on.
As long as you’re not blindly committing the code, it’s a huge time saver for a number of mundane tasks.
It’s especially fantastic for writing throwaway tooling. Need data massaged a specific way? Ez pz. Need a script to execute an api call on each entry in a spreadsheet? No problem.
The guy above you is a nutter. Not sure if people haven’t tried leveraging LLMs or what. It has a ton of faults, but it really does speed up the mundane work. Also, clearly the person is either brand new to the field or doesn’t even work in it. Otherwise they would have seen the barely functional shite that actual humans churn out.
Part of me wonders if code organization is going to start optimizing for interpretation by these models rather than humans.
-
Lmao, okay buddy, based on how many interviews I have sat on in, the chances that you are a worse programmer than me are much higher than you being better than me.
Being a pompous ass dismissive of new tooling makes you chances even worse
The person who uses fancy autocomplete to write their code will be exactly the person who thinks they're better than everyone. Those traits are correlated.
-
Yeah, it (in my case, ChatGPT) has been great for helping me along with functions I'm only passingly familiar with / trying to use in new ways.
One that I was really surprised with was that it gave me a surprisingly robust, sensible, and (seemingly) well tuned-to-my-case check list of things to inspect for a used car I intend to buy. I'm already mostly familiar with what I'm doing there, but it pointed to some things I might've overlooked / didn't know were points of concern for the specific vehicle I'm looking at.
Pepper Ridge Farms remembers when you could just do a web search and get it answered in the first couple results. Then the SEO wars happened....
-
I’ve been in the industry awhile and your assessment is dead on.
As long as you’re not blindly committing the code, it’s a huge time saver for a number of mundane tasks.
It’s especially fantastic for writing throwaway tooling. Need data massaged a specific way? Ez pz. Need a script to execute an api call on each entry in a spreadsheet? No problem.
The guy above you is a nutter. Not sure if people haven’t tried leveraging LLMs or what. It has a ton of faults, but it really does speed up the mundane work. Also, clearly the person is either brand new to the field or doesn’t even work in it. Otherwise they would have seen the barely functional shite that actual humans churn out.
Part of me wonders if code organization is going to start optimizing for interpretation by these models rather than humans.
When LLMs get it right it's because they're summarizing a stack overflow or GitHub snippet it was trained on. But you loose all the benefits of other humans commenting on the context, pitfalls and other alternatives.
-
yes, that's generally useless. It should not be shoved down people's throats. 30% accuracy still has its uses, especially if the result can be programmatically verified.
Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate.
LLMs don't get tired and they can be run in parallel. -
At least AI won't fire you.
DOGE has entered the chat
-
"...for multi-step tasks"
It's about Agents, which implies multi step as those are meant to execute a series of tasks opposed to studies looking at base LLM model performance.
-
Simple Wikiclaudia: Chrome extension that finds a simple.wikipedia.org version of any wiki article. If one exists, click to open it; otherwise, it uses Claude or ChatGPT to simplify it.
Technology1
-
-
-
-
-
Germany's Federal Cartel Office warns Amazon that its marketplace retailer price controls likely violate national and EU laws, in its preliminary assessment
Technology1
-
Is it feasible and scalable to combine self-replicating automata (after von Neumann) with federated learning and the social web?
Technology1
-
Brian Eno: “The biggest problem about AI is not intrinsic to AI. It’s to do with the fact that it’s owned by the same few people”
Technology1