Skip to content

AI agents wrong ~70% of time: Carnegie Mellon study

Technology
285 108 722
  • Again with dismissing the evidence of my own eyes!

    I wasn't asking it to do calculations, I was asking it to put the data into a super formulaic sentence. It was good at the first couple of rows then it would get stuck in a rut and start lying. It was crap. A seven year old would have done it far better, and if I'd told a seven year old that they had made a couple of mistakes and to check it carefully, they would have done.

    Again, I didn't read it in a fucking article, I read it on my fucking computer screen, so if you'd stop fucking telling me I'm stupid for using it the way it fucking told me I could use it, or that I'm stupid for believing what the media tell me about LLMs, when all I'm doing is telling you my own experience, you'd sound a lot less like a desperate troll or someone who is completely unable to assimilate new information that differs from your dogma.

    What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

  • Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

    Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

  • What does "I give it data to put in a formulaic sentence." mean here

    Why not just share the details. I often find a lot of people saying it's doing crazy things and never like to share the details. It's very similar to discussing things with Trump supporters who do the same shit when pressed on details about stuff they say occurs. Like the same "you're a troll for asking for evidence of my claim" that trumpets do. It's wild how similar it is.

    And yes asking to do things like iterate over rows isn't how it works. It's getting better but that's not what it's primarily used for. It could be but isn't. It only catches so many tokens. It's getting better and has some persistence but it's nowhere near what its strength is.

    I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

  • Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

    You're better off asking one human to do the same task ten times. Humans get better and faster at things as they go along. Always slower than an LLM, but LLMs get more and more likely to veer off on some flight of fancy, further and further from reality, the more it says to you. The chances of it staying factual in the long term are really low.

    It's a born bullshitter. It knows a little about a lot, but it has no clue what's real and what's made up, or it doesn't care.

    If you want some text quickly, that sounds right, but you genuinely don't care whether it is right at all, go for it, use an LLM. It'll be great at that.

  • This post did not contain any content.

    Reading with CEO mindset. 3 out of 10 employees can be fired.

  • I would be in breach of contract to tell you the details. How about you just stop trying to blame me for the clear and obvious lies that the LLM churned out and start believing that LLMs ARE are strikingly fallible, because, buddy, you have your head so far in the sand on this issue it's weird.

    The solution to the problem was to realise that an LLM cannot be trusted for accuracy even if the first few results are completely accurate, the bullshit well creep in. Don't trust the LLM. Check every fucking thing.

    In the end I wrote a quick script that broke the input up on tab characters and wrote the sentence. That's how formulaic it was. I regretted deeply trying to get an LLM to use data.

    The frustrating thing is that it is clearly capable of doing the task some of the time, but drifting off into FANTASY is its strong suit, and it doesn't matter how firmly or how often you ask it to be accurate or use the input carefully. It's going to lie to you before long. It's an LLM. Bullshitting is what it does. Get it to do ONE THING only, then check the fuck out of its answer. Don't trust it to tell you the truth any more than you would trust Donald J Trump to.

    This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

  • This is crazy. I've literally been saying they are fallible. You're saying your professional fed and LLM some type of dataset. So I can't really say what it was you're trying to accomplish but I'm just arguing that trying to have it process data is not what they're trained to do. LLM are incredible tools and I'm tired of trying to act like they're not because people keep using them for things they're not built to do. It's not a fire and forget thing. It does need to be supervised and verified. It's not exactly an answer machine. But it's so good at parsing text and documents, summarizing, formatting and acting like a search engine that you can communicate with rather than trying to grok some arcane sentence. Its power is in language applications.

    It is so much fun to just play around with and figure out where it can help. I'm constantly doing things on my computer it's great for instructions. Especially if I get a problem that's kind of unique and needs a big of discussion to solve.

    it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

  • it’s so good at parsing text and documents, summarizing

    No. Not when it matters. It makes stuff up. The less you carefully check every single fucking thing it says, the more likely you are to believe some lies it subtly slipped in as it went along. If truth doesn't matter, go ahead and use LLMs.

    If you just want some ideas that you're going to sift through, independently verify and check for yourself with extreme skepticism as if Donald Trump were telling you how to achieve world peace, great, you're using LLMs effectively.

    But if you're trusting it, you're doing it very, very wrong and you're going to get humiliated because other people are going to catch you out in repeating an LLM's bullshit.

    If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

  • If it's so bad as if you say, could you give an example of a prompt where it'll tell you incorrect information.

    It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

  • It's like you didn't listen to anything I ever said, or you discounted everything I said as fiction, but everything your dear LLM said is gospel truth in your eyes. It's utterly irrational. You have to be trolling me now.

    Should be easy if it's that bad though

  • Should be easy if it's that bad though

    I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

  • I already told you my experience of the crapness of LLMs and even explained why I can't share the prompt etc. You clearly weren't listening or are incapable of taking in information.

    There's also all the testing done by the people talked about in the article we're discussing which you're also irrationally dismissing.

    You have extreme confirmation bias.

    Everything you hear that disagrees with your absurd faith in the accuracy of the extreme blagging of LLMs gets dismissed for any excuse you can come up with.

    You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

  • You're projecting here. I'm asking you to give an example of any prompt. You're saying it's so bad that it needs to be babysat because it's errors. I'll only asking for your to give an example and you're saying that's confirmation bias and acting like I'm being religiously ignorant

    This is you

  • Windows 11 finally overtakes Windows 10 [in marketshare]

    Technology technology
    32
    1
    63 Stimmen
    32 Beiträge
    132 Aufrufe
    H
    Yeah, and its most likely only due to them killing Windows 10 in the fall, which means a lot of companies have been working hard this year to replace a ton of computers before October. Anyone who has been down this road with 7 to 10 knows it will just cost more money if you need to continue support after that. They sell you a new license thats good for a year that will allow updates to continue. It doubles in cost every year after.
  • 111 Stimmen
    24 Beiträge
    87 Aufrufe
    O
    Ingesting all the artwork you ever created by obtaining it illegally and feeding it into my plagarism remix machine is theft of your work, because I did not pay for it. Separately, keeping a copy of this work so I can do this repeatedly is also stealing your work. The judge ruled the first was okay but the second was not because the first is "transformative", which sadly means to me that the judge despite best efforts does not understand how a weighted matrix of tokens works and that while they may have some prevention steps in place now, early models showed the tech for what it was as it regurgitated text with only minor differences in word choice here and there. Current models have layers on top to try and prevent this user input, but escaping those safeguards is common, and it's also only masking the fact that the entire model is built off of the theft of other's work.
  • How the Rubin Observatory Will Reinvent Astronomy

    Technology technology
    2
    1
    53 Stimmen
    2 Beiträge
    19 Aufrufe
    M
    Giant twice-reflecting mirror of low-expansion borrosilicate covered in pure silver and a giant digital camera with filters.
  • 288 Stimmen
    46 Beiträge
    318 Aufrufe
    G
    Just for the record, even in Italy the winter tires are required for the season (but we can just have chains on board and we are good). Double checking and it doesn’t seem like it? Then again I don’t live in Italy. Here in Sweden you’ll face a fine of ~2000kr (roughly 200€) per tire on your vehicle that is out of spec. https://www.europe-consommateurs.eu/en/travelling-motor-vehicles/motor-vehicles/winter-tyres-in-europe.html Well, I live in Italy and they are required at least in all the northern regions and over a certain altitude in all the others from 15th November to 15th April. Then in some regions these limits are differents as you have seen. So we in Italy already have a law that consider a different situation for the same rule. Granted that you need to write a more complex law, but in the end it is nothing impossible. …and thus it is much simpler to handle these kinds of regulations at a lower level. No need for everyone everywhere to agree, people can have rules that work for them where they live, folks are happier and don’t have to struggle against a system run by bureaucrats so far away they have no idea what reality on the ground is (and they can’t, it’s impossible to account for every scenario centrally). Even on a municipal level certain regulations differ, and that’s completely ok! So it is not that difficult, just write a directive that say: "All the member states should make laws that require winter tires in every place it is deemed necessary". I don't really think that making EU more integrated is impossibile
  • How a Spyware App Compromised Assad’s Army

    Technology technology
    2
    1
    41 Stimmen
    2 Beiträge
    21 Aufrufe
    S
    I guess that's why you pay your soldiers. In the early summer of 2024, months before the opposition launched Operation Deterrence of Aggression, a mobile application began circulating among a group of Syrian army officers. It carried an innocuous name: STFD-686, a string of letters standing for Syria Trust for Development. ... The STFD-686 app operated with disarming simplicity. It offered the promise of financial aid, requiring only that the victim fill out a few personal details. It asked innocent questions: “What kind of assistance are you expecting?” and “Tell us more about your financial situation.” ... Determining officers’ ranks made it possible for the app’s operators to identify those in sensitive positions, such as battalion commanders and communications officers, while knowing their exact place of service allowed for the construction of live maps of force deployments. It gave the operators behind the app and the website the ability to chart both strongholds and gaps in the Syrian army’s defensive lines. The most crucial point was the combination of the two pieces of information: Disclosing that “officer X” was stationed at “location Y” was tantamount to handing the enemy the army’s entire operating manual, especially on fluid fronts like those in Idlib and Sweida.
  • No Frills PCB Brings USB-C Power To The Breadboard

    Technology technology
    3
    1
    0 Stimmen
    3 Beiträge
    8 Aufrufe
    coelacanthus@lemmy.kde.socialC
    and even without it, you'll get 150mA @ 5V by default out of the USB 3 host upstream and up to 900mA with some pretty basic USB negotiation in a protocol that dates from USB 1.0 That's wrong. With USB Type-C, you can get the power up to 3A @ 5V with just two 5.1kΩ resistor on CC pins.
  • 0 Stimmen
    2 Beiträge
    12 Aufrufe
    A
    I bet that information was already available to business owners. In other words, they totally knew it was you complaining about the toilet paper they used for example.
  • 0 Stimmen
    2 Beiträge
    9 Aufrufe
    T
    Wow, that's really concerning! It's crazy how these breaches can lead to such massive losses. If anyone's dealing with crypto fraud, I’ve heard Segev LLP is a solid firm that helps people and companies navigate these situations. Stay safe, everyone!