AI agents wrong ~70% of time: Carnegie Mellon study
-
We have created the overconfident intern in digital form.
Unfortunately marketing tries to sell it as a senior everything ologist
-
DocumentDB is not for one drive documents (PDFs and such). It's for "documents" as in serialized objects (json or bson).
That's even better, I can just jam something in before it and churn the documents through an embedding model, thanks!
-
This post did not contain any content.
I use it for very specific tasks and give as much information as possible. I usually have to give it more feedback to get to the desired goal. For instance I will ask it how to resolve an error message. I've even asked it for some short python code. I almost always get good feedback when doing that. Asking it about basic facts works too like science questions.
One thing I have had problems with is if the error is sort of an oddball it will give me suggestions that don't work with my OS/app version even though I gave it that info. Then I give it feedback and eventually it will loop back to its original suggestions, so it couldn't come up with an answer.
I've also found differences in chatgpt vs MS copilot with chatgpt usually being better results.
-
please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro
And let it suck up 10% or so of all of the power in the region.
-
The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.
Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn't show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.
I've had good results being very specific, like "Generate some python 3 code for me that converts X to Y, recursively through all subdirectories, and converts the files in place."
-
It's absolutely dangerous but it doesnt have to work even a little to do damage; hell, it already has. Your thing just makes it sound much more capable than it is. And it is not.
Also, it's not AI.
Edit: and in a comment replying to this one, one of your fellow fanboys proved
everyone knows how they work
Wrong
the industrial revolution could be seen as dangerous, yet it brought the highest standard of living increase in centuries
-
So you’re saying the article’s measurements about AI agents being wrong 70% of the time is made up? Or is AI performance only measurable when the results help anti-AI narratives?
I mean, sure, in that the expectation is that the article is talking about AI in general. The cited paper is discussing LLMs and their ability to complete tasks. So, we have to agree that LLMs are what we mean by AI, and that their ability to complete tasks is a valid metric for AI. If we accept the marketing hype, then of course LLMs are exactly what we've been talking about with AI, and we've accepted LLMs features and limitations as what AI is. If LLMs are prone to filling in with whatever closest fits the model without regard to accuracy, by accepting LLMs as what we mean by AI, then AI fits to its model without regard to accuracy.
-
I'm impressed you can make strides with Rust with AI. I am in a similar boat, except I've found LLMs are terrible with Rust.
I was 0/6 on various trials of AI for Rust over the past 6 months, then I caught a success. Turns out, I was asking it to use a difficult library - I can't make the thing I want work in that library either (library docs say it's possible, but...) when I posed a more open ended request without specifying the library to use, it succeeded - after a fashion. It will give you code with cargo build errors, I copy-paste the error back to it like "address: <pasted error message>" and a bit more than half of the time it is able to respond with a working fix.
-
I mean, sure, in that the expectation is that the article is talking about AI in general. The cited paper is discussing LLMs and their ability to complete tasks. So, we have to agree that LLMs are what we mean by AI, and that their ability to complete tasks is a valid metric for AI. If we accept the marketing hype, then of course LLMs are exactly what we've been talking about with AI, and we've accepted LLMs features and limitations as what AI is. If LLMs are prone to filling in with whatever closest fits the model without regard to accuracy, by accepting LLMs as what we mean by AI, then AI fits to its model without regard to accuracy.
Except you yourself just stated that it was impossible to measure performance of these things. When it’s favorable to AI, you claim it can’t be measured. When it’s unfavorable for AI, you claim of course it’s measurable. Your argument is so flimsy and your understanding so limited that you can’t even stick to a single idea. You’re all over the place.
-
I've had good results being very specific, like "Generate some python 3 code for me that converts X to Y, recursively through all subdirectories, and converts the files in place."
I have been more successful with baby steps like: "Write a python 3 program that converts X to Y." Tweak prompt until that's working as desired, then: "make it work recursively through all subdirectories" - and again tweak with specifics like converting the files in place, etc. Always very specific, also - force it to fix its own bugs so you can move forward with a clean example as you add complexity. Complexity seems to cap out at a couple of pages of code, at which point "Ooops, something went wrong."
-
A junior developer actually learns from doing the job, an LLM only learns when they update the training corpus and develop an updated model.
an llm costs less, and won't compain when yelled at
-
The comparison is about the correctness of their work.
Their lives have nothing to do with it.
Human lives are the most important thing of all. Profits are irrelevant compared to human lives. I get that that's not how Besos sees the world, but he's a monstrous outlier.
-
Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate.
LLMs don't get tired and they can be run in parallel.What's 0.7^10?
-
Search AI in Lemmy and check out every article on it. It definitely is media spreading all the hate. And like this article is often some money yellow journalism
I think it's lemmy users. I see a lot more LLM skepticism here than in the news feeds.
In my experience, LLMs are like the laziest, shittiest know-nothing bozo forced to complete a task with zero attention to detail and zero care about whether it's crap, just doing enough to sound convincing.
-
That's not really helping though. The fact that you were transferred to them in the first place instead of directly to a human was an impediment.
Oh absolutely, nothing was gained, time was wasted. My wording was too charitable.
-
Except you yourself just stated that it was impossible to measure performance of these things. When it’s favorable to AI, you claim it can’t be measured. When it’s unfavorable for AI, you claim of course it’s measurable. Your argument is so flimsy and your understanding so limited that you can’t even stick to a single idea. You’re all over the place.
It questionable to measure these things as being reflective of AI, because what AI is changes based on what piece of tech is being hawked as AI, because we're really bad at defining what intelligence is and isn't. You want to claim LLMs as AI? Go ahead, but you also adopt the problems of LLMs as the problems of AIs. Defining AI and thus its metrics is a moving target. When we can't agree to what is is, we can't agree to what it can do.
-
This post did not contain any content.
I actually have a fairly positive experience with ai ( copilot using claude specificaly ). Is it wrong a lot if you give it a huge task yes, so i dont do that and using as a very targeted solution if i am feeling very lazy today . Is it fast . Also not . I could actually be faster than ai in some cases.
But is it good if you are working for 6h and you just dont have enough mental capacity for the rest of the day. Yes . You can just prompt it specificaly enough to get desired result and just accept correct responses. Is it always good ,not really but good enough. Do i also suck after 3pm . Yes.
My main issue is actually the fact that it saves first and then asks you to pick if you want to use it. Not a problem usualy but if it crashes the generated code stays so that part sucks -
And let it suck up 10% or so of all of the power in the region.
And water
-
It questionable to measure these things as being reflective of AI, because what AI is changes based on what piece of tech is being hawked as AI, because we're really bad at defining what intelligence is and isn't. You want to claim LLMs as AI? Go ahead, but you also adopt the problems of LLMs as the problems of AIs. Defining AI and thus its metrics is a moving target. When we can't agree to what is is, we can't agree to what it can do.
Again, you only say it’s a moving target to dispel anything favorable towards AI. Then you do a complete 180 when it’s negative reporting on AI. Makes your argument meaningless, if you can’t even stick to your own point.
-
an llm costs less, and won't compain when yelled at
Why would you ever yell at an employee unless you're bad at managing people? And you think you can manage an LLM better because it doesn't complain when you're obviously wrong?
-
$219 Springer Nature book "Mastering Machine Learning: From Basics to Advanced" was written with a chatbot
Technology1
-
-
-
-
1
-
-
-
The European Commission says it is investigating Pornhub, Stripchat, XNXX, and XVideos for potential child safety Digital Services Act (DSA) violations “as a matter of priority”
Technology1