ResearchGate Twitter Flickr Tumblr

As mentioned previously, I’m preparing course materials and practical exercises around AI/LLM/GPT models for the upcoming term, and I talked to coders (game engineering, mostly) who tried out ChatGPT’s coding assistance abilities. The gist is, ChatGPT gets well-documented routines right, but even then its answers are blissfully oblivious of best practice considerations; gets more taxing challenges wrong or, worse, subtly wrong; and begins to fall apart and/or make shit up down the road. (As a disclaimer, I’m not a programmer; I’ve always restricted myself to scripting, as there are too many fields and subjects in my life I need to keep up with already.)

Against this backdrop, here’s a terrific post by Tyler Glaiel on Substack: “Can GPT-4 *Actually* Write Code?” Starting with an example of not setting the cat on fire, his post is quite technical and goes deep into the weeds—but that’s exactly what makes it interesting. Glaiel’s summary:

Would this have helped me back in 2020? Probably not. I tried to take its solution and use my mushy human brain to modify it into something that actually worked, but the path it was going down was not quite correct, so there was no salvaging it. […] I tried this again with a couple of other “difficult” algorithms I’ve written, and it’s the same thing pretty much every time. It will often just propose solutions to similar problems and miss the subtleties that make your problem different, and after a few revisions it will often just fall apart. […]

The crescent example is a bit damning here. ChatGPT doesn’t know the answer, there was no example for this in its training set and it can’t find that in its model. The useful thing to do would be to just say “I do not know of an algorithm that does this.” But instead it’s overconfident in its own capabilities, and just makes shit up. It’s the same problem it has with plenty of other fields, though its strange competence in writing simple code sorta hides that fact a bit.

As a curiosity, he found that ChatGPT-3.5 came closer to one specific answer than ChatGPT-4:

When I asked GPT-3.5 accidentally it got much much closer. This is actually a “working solution, but with some bugs and edge cases.” It can’t handle a cycle of objects moving onto each other in a chain, but yeah this is much better than the absolute nothing GPT-4 gave… odd…

Generally, we shouldn’t automatically expect GPT to become enormously better with each version; besides the law of diminishing returns, there is no reason to assume, without evidence, that making LLMs bigger and bigger will make them better and better. But yes, we can at least expect that updated versions won’t perform worse.

And then there’s this illuminating remark by Glaiel, buried in the comments:

GPT-4 can’t reason about a hard problem if it doesn’t have example references of the same problem in its training set. That’s the issue. No amount of tweaking the prompt or overexplaining the invariants (without just writing the algorithm in English, which if you can get to that point then you already solved the hard part of the problem) will get it to come to a proper solution, because it doesn’t know one. You’re welcome to try it yourself with the problems I posted here.

That’s the point. LLMs do not think and cannot reason. They can only find and deliver solutions that already exist.

Finally, don’t miss ChatGPT’s self-assessment on these issues, after Glaiel fed the entire conversation back to it and asked it to write the “final paragraph” for his post!

permalink