ChatGPT 'got absolutely wrecked' by Atari 2600 in beginner's chess match — OpenAI's newest model bamboozled by 1970s logic
-
An LLM is a poor computational/predictive paradigm for playing chess.
Actually, a very specific model (chatgpt3.5-turbo-instruct) was pretty good at chess (around 1700 elo if i remember correctly).
-
Fair point.
I liked the "upgraded autocompletion", you know, an completion based on the context, just before the time that they pushed it too much with 20 lines of non sense...
Now I am thinking of a way of doing the thing, then I receive a 20 lines suggestion.
So I am checking if that make sense, losing my momentum, only to realize the suggestion us calling shit that don't exist...
Screw that.
The amount of garbage it spits out in autocomplete is distracting. If it's constantly making me 5-10% less productive the many times it's wrong, it should save me a lot of time when it is right, and generally, I haven't found it able to do that.
Yesterday I tried to prompt it to change around 20 call sites for a function where I had changed the signature. Easy, boring and repetitive, something that a junior could easily do. And all the models were absolutely clueless about it (using copilot)
-
Did the author thinks ChatGPT is in fact an AGI? It's a chatbot. Why would it be good at chess? It's like saying an Atari 2600 running a dedicated chess program can beat Google Maps at chess.
OpenAI has been talking about AGI for years, implying that they are getting closer to it with their products.
Not to even mention all the hype created by the techbros around it.
-
All AIs are the same. They're just scraping content from GitHub, stackoverflow etc with a bunch of guardrails slapped on to spew out sentences that conform to their training data but there is no intelligence. They're super handy for basic code snippets but anyone using them anything remotely complex or nuanced will regret it.
I've used agents for implementing entire APIs and front-ends from the ground up with my own customizations and nuances.
I will say that, for my pedantic needs, it typically only gets about 80-90% of the way there so I still have to put fingers to code, but it definitely saves a boat load of time in those instances.
-
OpenAI has been talking about AGI for years, implying that they are getting closer to it with their products.
Not to even mention all the hype created by the techbros around it.
Hey I didn't say anywhere that corporations don't lie to promote their product did I?
-
All AIs are the same. They're just scraping content from GitHub, stackoverflow etc with a bunch of guardrails slapped on to spew out sentences that conform to their training data but there is no intelligence. They're super handy for basic code snippets but anyone using them anything remotely complex or nuanced will regret it.
One of my mates generated an entire website using Gemini. It was a React web app that tracks inventory for trading card dealers. It actually did come out functional and well-polished. That being said, the AI really struggled with several aspects of the project that humans would not:
- It left database secrets in the code
- The design of the website meant that it was impossible to operate securely
- The quality of the code itself was hot garbage—unreadable and undocumented nonsense that somehow still worked
- It did not break the code into multiple files. It piled everything into a single file
-
You're not wrong, but keep in mind ChatGPT advocates, including the company itself are referring to it as AI, including in marketing. They're saying it's a complete, self-learning, constantly-evolving Artificial Intelligence that has been improving itself since release... And it loses to a 4KB video game program from 1979 that can only "think" 2 moves ahead.
That's totally fair, the company is obviously lying, excuse me "marketing", to promote their product, that's absolutely true.
-
There are custom GPTs which claim to play at a stockfish level or be literally stockfish under the hood (I assume the former is still the latter just not explicitly). Haven't tested them, but if they work, I'd say yes. An LLM itself will never be able to play chess or do anything similar, unless they outsource that task to another tool that can. And there seem to be GPTs that do exactly that.
As for why we need ChatGPT then when the result comes from Stockfish anyway, it's for the natural language prompts and responses.
It's not an LLM, but Stockfish does use AI under the hood and has been since 2020. Stockfish uses a classical alpha-beta search strategy (if I recall correctly) combined with a neural network for smarter pruning.
There are some engines of comparable strength that are primarily neural-network based.
lc0
comes to mind.lc0
placed 2nd in the Top Chess Engine Championships in 9 out of the past 10 seasons. By comparison, Stockfish is currently on a 10-season win streak in the TCEC. -
Gotham chess has a video of making chatgpt play chess against stockfish. Spoiler: chatgpt does not do well. It plays okay for a few moves but then the moment it gets in trouble it straight up cheats. Telling it to follow the rules of chess doesn't help.
This sort of gets to the heart of LLM-based "AI". That one example to me really shows that there's no actual reasoning happening inside. It's producing answers that statistically look like answers that might be given based on that input.
For some things it even works. But calling this intelligence is dubious at best.
Because it doesn't have any understanding of the rules of chess or even an internal model of the game state, it just has the text of chess games in its training data and can reproduce the notation, but nothing to prevent it from making illegal moves, trying to move or capture pieces that don't exist, incorrectly declaring check/checkmate, or any number of nonsensical things.
-
Gotham chess has a video of making chatgpt play chess against stockfish. Spoiler: chatgpt does not do well. It plays okay for a few moves but then the moment it gets in trouble it straight up cheats. Telling it to follow the rules of chess doesn't help.
This sort of gets to the heart of LLM-based "AI". That one example to me really shows that there's no actual reasoning happening inside. It's producing answers that statistically look like answers that might be given based on that input.
For some things it even works. But calling this intelligence is dubious at best.
I think the biggest problem is it's very low ability to "test time adaptability". Even when combined with a reasonning model outputting into its context, the weights do not learn out of the immediate context.
I think the solution might be to train a LoRa overlay on the fly against the weights and run inference with that AND the unmodified weights and then have an overseer model self evaluate and recompose the raw outputs.
Like humans are way better at answering stuff when it's a collaboration of more than one person. I suspect the same is true of LLMs.
-
GPTs which claim to use a stockfish API
Then the actual chess isn't LLM. If you are going stockfish, then the LLM doesn't add anything, stockfish is doing everything.
The whole point is the marketing rage is that LLMs can do all kinds of stuff, doubling down on this with the branding of some approaches as "reasoning" models, which are roughly "similar to 'pre-reasoning', but forcing use of more tokens on disposable intermediate generation steps". With this facet of LLM marketing, the promise would be that the LLM can "reason" itself through a chess game without particular enablement. In practice, people trying to feed in gobs of chess data to an LLM end up with an LLM that doesn't even comply to the rules of the game, let alone provide reasonable competitive responses to an oppone.
Then the actual chess isn't LLM.
And neither did the Atari 2600 win against ChatGPT. Whatever game they ran on it did.
That's my point here. The fact that neither Atari 2600 nor ChatGPT are capable of playing chess on their own. They can only do so if you provide them with the necessary tools. Which applies to both of them. Yet only one of them was given those tools here.
-
An LLM is a poor computational/predictive paradigm for playing chess.
Yeah, a lot of them hallucinate illegal moves.
-
The Atari 2600 is just hardware. The software came on plug-in cartridges. Video Chess was released for it in 1979.
-
well so much hype has been generated around chatgpt being close to AGI that now it makes sense to ask questions like "can chatgpt prove the Riemann hypothesis"
Even the models that pretend to be AGI are not. It's been proven.
-
Then the actual chess isn't LLM.
And neither did the Atari 2600 win against ChatGPT. Whatever game they ran on it did.
That's my point here. The fact that neither Atari 2600 nor ChatGPT are capable of playing chess on their own. They can only do so if you provide them with the necessary tools. Which applies to both of them. Yet only one of them was given those tools here.
Fine, a chess engine that is capable of running with affordable even for the time 1970s electronics will best what marketing folks would have you think is an arbitrarily capable "reasoning" model running on top of the line 2025 hardware.
You can split hairs about "well actually, the 2600 is hardware and a chess engine is the software" but everyone gets the point.
As to assertions that no one should expect an LLM to be a chess engine, well tell that to the industry that is asserting the LLMs are now "reasoning" and provides a basis to replace most of the labor pool. We need stories like this to calibrate expectations in a way common people can understand..
-
I swear every single article critical of current LLMs is like, "The square got BLASTED by the triangle shape when it completely FAILED to go through the triangle shaped hole."
Well, the first and obvious thing to do to show that AI is bad is to show that AI is bad. If it provides that much of a low-hanging fruit for the demonstration... that just further emphasizes the point.
-
Sometimes it seems like most of these AI articles are written by AIs with bad prompts.
Human journalists would hopefully do a little research. A quick search would reveal that researches have been publishing about this for over a year so there's no need to sensationalize it. Perhaps the human journalist could have spent a little time talking about why LLMs are bad at chess and how researchers are approaching the problem.
LLMs on the other hand, are very good at producing clickbait articles with low information content.
In this case it's not even bad prompts, it's a problem domain ChatGPT wasn't designed to be good at. It's like saying modern medicine is clearly bullshit because a doctor loses a basketball game.
-
To be fair, a decent chunk of coding is stupid boilerplate/minutia that varies environment to environment, language to language, library to library.
So LLM can do some code completion, filling out a bunch of boilerplate that is blatantly obvious, generating the redundant text mandated by certain patterns, and keeping straight details between languages like "does this language want join as a method on a list with a string argument, or vice versa?"
Problem is this can be sometimes more annoying than it's worth, as miscompletions are annoying.
a decent chunk of coding is stupid boilerplate/minutia that varies
...according to a logic, which means LLMs are bad at it.
-
This post did not contain any content.
Ah, you used logic. That's the issue. They don't do that.
-
a decent chunk of coding is stupid boilerplate/minutia that varies
...according to a logic, which means LLMs are bad at it.
I'd say that those details that vary tend not to vary within a language and ecosystem, so a fairly dumb correlative relationship is enough to generally be fine. There's no way to use logic to infer that it's obvious that in language X you need to do mylist.join(string) but in language Y you need to do string.join(mylist), but it's super easy to recognize tokens that suggest those things and a correlation to the vocabulary that matches the context.
Rinse and repeat for things like do I need to specify type and what is the vocabulary for the best type for a numeric value, This variable that makes sense is missing a declaration, does this look to actually be a new distinct variable or just a typo of one that was declared.
But again, I'm thinking mostly in what kind of sort of can work, my experience personally is that it's wrong so often as to be annoying and get in the way of more traditional completion behaviors that play it safe, though with less help particularly for languages like python or javascript.