Study finds AI mostly useless at solving problems for coders

Princeton University researchers exposes limits of ChatGPT and Claude

8 February 2024

ChatGPT and other language-based artificial intelligence models are often used to generate pieces of code for computer programs. If you are a computer science student with a problem, you often cannot turn to ChatGPT for a working solution. However, according to a Princeton University study, AI computing solutions almost don’t work.

Researchers used a piece of computer code with a problem in it to test numerous AI large language models. In each case, they specified what the problem was and asked if the AI program could write out a solution.

The output was usually swept as a patch telling a computer exactly which lines of code to modify. The researchers then put the patches into a simulation program and looked at whether the problem was actually solved.

The idea was that the AI models should look for the problem themselves. Two research methods were used, which was not easy given the limits on the number of characters you can enter: The Sparse method involves mathematically arranging (‘BM25’) the code until only the essentials remain. Alternatively, the Oracle method looked at a pre-existing solution to the problem and fed the AI bots with only the relevant code. This method gave better results, but was less realistic.

The results were staggering: using the second method, Claude 2 got the best marks, with a meager 4.8% correct solutions. GPT-4, the paid version of ChatGPT, achieved 1.74%. And for all programming problems, GPT-3.5 only came up with a working solution in 0.52% of cases.

The conclusion: AI can’t see the forest for the trees.

According to the researchers, the models got harder when longer code was entered. AI is not capable enough to notice what is relevant and what is not, and got confused when it encountered pieces of code it did not need to solve the problem.

It works better when you ask for a patch as a solution. This has to do with how the model is trained. The patches in question are poorly composed, though, which is why they almost never work. The researchers thought they would find a solution in having the full code generated, but this worked even worse.

Moreover, all the models modified only a minute amount of code, when more extensive modifications would have worked better.

They also found little difference between models developed before or after 2023, with the exception of GPT-4.

The scientists saw this as a hopeful sign that newer GPT models may become better at finding and fixing non-working code.

News Wires

Read More:

Back to Top ↑