How AI Systems Fail the Human Test

Economists have a game that reveals how deeply individuals reason. Known as the 11-20 money request game, it is played between two players who each request an amount of money between 11 and 20 shekels, knowing that both will receive the amount they ask for.

But there’s a twist: if one player asks for exactly one shekel less than the other, that player earns a bonus of 20 shekels. This tests each player’s ability to think about what their opponent might do — a classic challenge of strategic reasoning.

The 11-20 game is an example of level-k reasoning in game theory, where each player tries to anticipate the other’s thought process and adjust their own choices accordingly. For example, a player using level-1 reasoning might pick 19 shekels, assuming the other will pick 20. But a level-2 thinker might ask for 18, predicting that their opponent will go for 19. This kind of thinking gets layered, creating an intricate dance of strategy and second-guessing.

Human Replacements?

In recent years, various researchers have suggested that large language models (LLMs) like ChatGPT and Claude can behave like humans in a wide range of tasks. That’s raised the possibility that LLMs could replace humans in tasks like testing opinions of new products and adverts before they are released to the human market, an approach that would be significantly cheaper than current methods.

But that raises the important question of whether LLM behavior really is similar to humans’. Now we get an answer thanks to the work of Yuan Gao and colleagues at Boston University, who have used a wide range of advanced LLMs to play the 11-20 game. They found that none of these AI systems produced results similar to human players and say that extreme caution is needed when it comes to using LLMs as surrogates for humans.

The team’s approach is straightforward. They explained the rules of the game to LLMs, including several models from ChatGPT, Claude, and Llama. They asked each to choose a number and then explain its reasoning. And they repeated the experiment a thousand times for each LLM.

But Gao and co were not impressed with the results. Human players typically use sophisticated strategies that reflect deeper reasoning levels. For example, a common human choice might be 17, reflecting an assumption that their opponent will select a higher value like 18 or 19. But the LLMs showed a starkly different pattern: many simply chose 20 or 19, reflecting basic level-0 or level-1 reasoning.

The researchers also tried to improve the performance of LLMs with techniques like writing more suitable prompts and fine-tuning the models. GPT-4 showed more human-like responses as a result, but the others all failed to.

The behavior of LLMs was also highly inconsistent depending on irrelevant factors, such as the language they were prompted in.

Gao and co say the reason LLMs fail to reproduce human behavior is that they don’t reason like humans. Human behavior is complex, driven by emotions, biases, and varied interpretations of incentives, like the desire to beat an opponent. LLMs give their answer using patterns in language to predict the next word in a sentence, a process that is fundamentally different to human thinking.

Sobering Result

That’s likely to be a sobering result for social scientists, for whom the idea that LLMs could replace humans in certain types of experiments is tempting.

But Gao and co say: “Expecting to gain insights into human behavioral patterns through experiments on LLMs is like a psychologist interviewing a parrot to understand the mental state of its human owner.” The parrot might use similar words and phrases to its owner but manifestly without insight.

“These LLMs are human-like in appearance yet fundamentally and unpredictably different in behavior,” they say.

Social scientists: you have been warned!

Ref: Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina : arxiv.org/abs/2410.19599

Source