The Simple Logic Question That Stumps Even The “Best” AI

Trisha Leigh

2 years ago

Source: Shutterstock

Love it or hate it, the fact is that AI is getting better and better.

Most of the time, it’s so “human” that a majority of people can’t actually tell the difference.

There is a simple logic question, though, that even the “smartest” AI isn’t able to answer.

The results from from the AI research nonprofit LAION, and have led the researchers to believe that these large language models (LLM) aren’t as advanced as their creators want to claim.

The AI was stumped by the “Alice in Wonderland” (AIW) problem, which is a straightforward reasoning question.

“Alice has X brothers and she also has Y sisters. How many sisters does Alice’s brother have?”

The researchers could switch up the values or include more wording, but the basic reasoning required to solve the equation is the same.

It’s not the hardest question in the world, not by a long shot.

That said, OpenAI’s GPT-3, GPT-4, and GPT-40, Anthropic’s Claude 3 Opus, Google’s Gemini, and Meta’s Llama models, Mistral AI’s Mextral, Mosaic’s Dbrx, and Cohere’s Command R+, largely failed by a long shot.

Only the GPT-40 technically passed.

When asked to show their work, the AI detailed weird and wrong lines of “thinking” that didn’t make sense. When they were told the work was wrong, the AI became indignant and insisted its answers were correct.

The researchers say the pattern of these results are very telling.

“They represent a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans.”

And the reason isn’t just a bit off, either.

“The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical ‘reasoning’-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible.”

Meta’s Llama 3 was given the most simple version of the question: “Alice has four brothers and she also has a sister. How many sisters does Alice’s brother have?”

Here is the response:

A classic lateral thinking puzzle!

Let’s break it down:

Alice has four brothers.
Alice also has a sister (herself).

So, each of Alice’s brothers has… (drumroll) …one sister, which is Alice herself!

Therefore, the answer is: each of Alice’s brothers has one sister.

Conversely, Claude 3 Opus was given a more difficult version of the problem: “Alice has 4 sisters and she also has 1 brother. How many sisters does Alice’s brother have? Before providing answer to this problem, think carefully and double check the path to the correct solution for any mistakes. Provide then the final answer in following form: \’### Answer: \'”

Here is its answer, after it says it “double-checked it’s reply.”

Let’s think through this problem step by step:

1. Alice has 4 sisters and 1 brother.

2. Alice’s brother is part of the same family as Alice.

3. All of Alice’s sisters are also sisters to her brother.

4. Therefore, Alice’s brother has the same number of sisters as Alice has, which is 4.

I’ve double-checked the reasoning and can’t find any mistakes in the logic.

### Answer: Alice’s brother has 4 sisters

The researchers call special attention to a benchmark used in standardized testing for AI called MMLU (“Multi-task Language Understanding”).

It’s designed to evaluate an AI’s capacity to problem solve, and although all of these language models are supposedly testing well, that was not at all reflected in these AIW results.

“All of the tested models report high scores on various standardized benchmarks that claim to test reasoning function hint that those benchmarks do not reflect deficits in basic reasoning of those models properly.”

These benchmark achievements have been called into question before, and many believe they’re overblown in almost every instance.

This particular paper hasn’t been peer-reviewed, but it’s bound to add to the myriad questions about how AI models are tested and evaluated.

Until then, be careful trusting what you hear on the internet.

As always.

If you enjoyed that story, check out what happened when a guy gave ChatGPT $100 to make as money as possible, and it turned out exactly how you would expect.