January 23, 2025

Q-tips — The real research behind the wild rumors about OpenAIs Q* project OpenAI hasn’t said what Q* is, but it has revealed plenty of clues.

Timothy B. Lee – Dec 8, 2023 12:45 pm UTC EnlargeAurich Lawson | Getty Images reader comments 112

On November 22, a few days after OpenAI fired (and then re-hired) CEO Sam Altman,The Information reportedthat OpenAI had made a technical breakthrough that would allow it to develop far more powerful artificial intelligence models. Dubbed Q* (and pronounced Q star) the new model was able to solve math problems that it hadnt seen before.

Reuterspublished a similar story, but details were vague.

Both outlets linked this supposed breakthrough to the boards decision to fire Altman. Reuters reported that several OpenAI staffers sent the board a letter warning of a powerful artificial intelligence discovery that they said could threaten humanity. However, Reuters was unable to review a copy of the letter, and subsequent reporting hasnt connected Altmans firing to concerns over Q*.

The Information reported that earlier this year, OpenAI built systems that could solve basic math problems, a difficult task for existing AI models. Reuters described Q* as performing math on the level of grade-school students.

Instead of immediately leaping in with speculation, I decided to take a few days to do some reading. OpenAI hasnt published details on its supposed Q* breakthrough, but ithaspublished two papers about its efforts to solve grade-school math problems. And a number of researchers outside of OpenAIincluding at Googles DeepMindhave been doing important work in this area.

Im skeptical that Q*whatever it isisthecrucial breakthrough that will lead to artificial general intelligence. I certainly dont think its a threat to humanity. But it might be an important step toward an AI with general reasoning abilities.

In this piece, Ill offer a guided tour of this important area of AI research and explain why step-by-step reasoning techniques designed for math problems could have much broader applications. The power of reasoning step by step

Consider the following math problem:

John gave Susan five apples and then gave her six more. Susan then ate three apples and gave three to Charlie. She gave her remaining apples to Bob, who ate one. Bob then gave half his apples to Charlie. John gave seven apples to Charlie, who gave Susan two-thirds of his apples. Susan then gave four apples to Charlie. How many apples does Charlie have now?

Before you continue reading, see if you can solve the problem yourself. Ill wait. Advertisement

Most of us memorized basic math facts like 5+6=11 in grade school. So if the problem just said, John gave Susan five apples and then gave her six more, wed be able to tell at a glance that Susan had 11 apples.

But for more complicated problems, most of us need to keep a running tallyeither on paper or in our headsas we work through it. So first we add up 5+6=11. Then we take 11-3=8. Then 8-3=5, and so forth. By thinking step-by-step, well eventually get to the correct answer: 8.

The same trick works for large language models. In afamous January 2022 paper, Google researchers pointed out that large language models produce better results if they are prompted to reason one step at a time. Heres a key graphic from their paper: Enlarge

This paper was published before zero-shot prompting was common, so they prompted the model by giving an example solution. In the left-hand column, the model is prompted to jump straight to the final answerand gets it wrong. On the right, the model is prompted to reason one step at a time and gets the right answer. The Google researchers dubbed this technique chain-of-thought prompting; it is still widely used today.

If you read ourJuly articleexplaining large language models, you might be able to guess why this happens.

To a large language model, numbers like five and six are tokensno different from the or cat. An LLM learns that 5+6=11 because this sequence of tokens (and variations like five and six make eleven) appears thousands of times in its training data. But an LLMs training data probably doesnt include any examples of a long calculation like ((5+6-3-3-1)/2+3+7)/3+4=8. So if a language model is asked to do this calculation in a single step, its more likely to get confused and produce the wrong answer.

Another way to think about it is that large language models dont have any external scratch space to store intermediate results like 5+6=11. Chain-of-thought reasoning enables an LLM to effectively use its own output as scratch space. This allows it to break a complicated problem down into bite-sized stepseach of which is likely to match examples in the models training data. Page: 1 2 3 4 5 Next → reader comments 112 Timothy B. Lee Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC. Advertisement Channel Ars Technica ← Previous story Next story → Related Stories Today on Ars