- “Reasoning” ≠ autocomplete. Cutting‑edge systems increasingly combine pattern completion with planning, tool use, and world models—what Yann LeCun calls “Mode‑2: reasoning and planning using the world model.” OpenReview
- Frontier models add deliberate thinking. OpenAI says o3 is its “most powerful reasoning model,” while o1 was “designed to spend more time thinking before [it] respond[s].” OpenAI
- Test‑time compute is becoming a knob. Anthropic’s Claude now exposes “extended thinking mode” and even a user‑set “thinking budget.” Anthropic
- Formal math is a breakthrough frontier. DeepMind’s AlphaProof + AlphaGeometry 2 solved 4/6 IMO 2024 problems (28/42 points—silver‑medal level), with Fields Medalist Sir Tim Gowers calling one construction “very impressive.” Google DeepMind
- Programming competitions crossed a line. A special Gemini 2.5 variant won gold at an ICPC event; skeptics like Stuart Russell caution that “claims of epochal significance seem overblown.” The Guardian
- Benchmarks are shifting to harder, reasoning‑heavy tests like GPQA and MMLU‑Pro because older suites (e.g., MMLU) saturate. arXiv
- Reasoning recipes matter. Methods such as Chain‑of‑Thought, Self‑Consistency, Tree‑of‑Thoughts, and ReAct reliably lift performance on multi‑step tasks. arXiv
- But limits remain. Melanie Mitchell: “No current AI system is anywhere close to forming humanlike abstractions or analogies.” PubMed
- Commonsense is still the missing substrate. Yejin Choi calls it “the dark matter of intelligence.” Quanta Magazine
- Causality is the next hill. Judea Pearl argues that more data alone won’t get you there; we need causal models. bayes.cs.ucla.edu
- Some experts say LLMs still don’t “really” reason. Gary Marcus: “We found no evidence of formal reasoning in language models.” garymarcus.substack.com
1) What “reasoning” means in AI (today)
In 2025, “reasoning” spans at least four layers:
- Pattern completion (classic next‑token prediction)
- Step‑by‑step inference (explicit chains of thought; search over multiple candidate solutions)
- Tool‑use & environment interaction (retrieval, calculators, APIs, agents)
- Planning in a world model (simulate future states; choose actions to minimize cost)
LeCun’s research codifies the leap from reactive policies to “Mode‑2: reasoning and planning using the world model”—i.e., simulating outcomes before acting. OpenReview
On the “thinking‑more, not just bigger” trend: OpenAI’s o‑series leaned into deliberation, stating that o1 was “designed to spend more time thinking before [it] respond[s],” while o3 “pushes the frontier across coding, math, science, [and] visual perception.” OpenAI Anthropic made that knob explicit: users can toggle “extended thinking mode” and even set a “thinking budget.” Anthropic
2) What just got better
Formal, verifiable reasoning.
DeepMind paired language models with symbolic proof systems. In July 2024, AlphaProof + AlphaGeometry 2 solved four of six International Mathematical Olympiad problems—28/42 points, equivalent to an IMO silver medal. Sir Tim Gowers: “The fact that the program can come up with a non‑obvious construction like this is very impressive.” Google DeepMind
Competitive programming & algorithmic search.
A bespoke Gemini 2.5 variant took gold at a programming contest; Google touted a “profound leap in abstract problem‑solving,” while Stuart Russell urged caution, calling “claims of epochal significance…overblown.” Quoc Le likened it to Deep Blue and AlphaGo moments. The Guardian
Agentic, tool‑using models.
Anthropic reports that longer, visible reasoning plus iterative tool use measurably lifts scores and enables open‑ended tasks. Their post introduces serial and parallel test‑time compute scaling (multiple independent “thoughts” voted or scored), with big gains on the GPQA science benchmark. Anthropic
3) The techniques powering the surge
- Chain‑of‑Thought (CoT): show worked steps; unlocks multi‑step arithmetic and logic. arXiv
- Self‑Consistency: sample many CoTs and vote—a simple but powerful accuracy boost. arXiv
- Tree‑of‑Thoughts (ToT): branch and backtrack through candidate reasoning paths (deliberate search). arXiv
- ReAct: interleave reasoning with actions, so the model can fetch evidence, then revise plans. arXiv
Together, these methods make test‑time compute a first‑class resource: more thinking steps (or more sampled thoughts) often means better answers—exactly what Anthropic’s thinking budgets operationalize. Anthropic
4) Benchmarks and what they really measure
Because older suites (MMLU, GSM8K) started to saturate, researchers introduced harder, more “google‑proof” evaluations. GPQA uses domain‑expert science questions; MMLU‑Pro expands choices and emphasizes reasoning over recall. These aim to separate true reasoning from memorization and prompt quirks. arXiv
Even so, meta‑analyses warn leaderboards can mislead; performance may hinge on contamination, format sensitivity, or overfitting. (See the 2025 “Line Goes Up?” paper on benchmark limitations.) arXiv
5) Where reasoning still breaks
- Humanlike abstraction and analogy. Melanie Mitchell’s verdict remains sobering: “No current AI system is anywhere close” to humanlike abstraction/analogy. PubMed
- Commonsense. Yejin Choi calls commonsense “the dark matter of intelligence,” the invisible substrate that shapes interpretation and prediction. Quanta Magazine
- Causality. Judea Pearl argues that moving up the “ladder of causation” requires explicit causal models; “merely collecting big data would not have helped us go up the ladder.” bayes.cs.ucla.edu
- Are LLMs truly reasoning? Gary Marcus summarizes a critical stance: “We found no evidence of formal reasoning in language models… [their] behavior is better explained by sophisticated pattern matching.” garymarcus.substack.com
6) The symbolic–neural détente
The most compelling 2025 systems are hybrids: statistical learners for pattern discovery + symbolic/search components for precision and verifiability. DeepMind’s AlphaProof embodies this neuro‑symbolic blend (LM for proposing, symbolic prover for verifying), achieving formal, checkable proofs—a direction many believe will generalize beyond math. Google DeepMind
7) State of the frontier models
- OpenAI o‑series. o1: “spend more time thinking.” o3: “our most powerful reasoning model,” reporting fewer major errors vs. o1 on hard tasks and new SOTA on several reasoning benchmarks. OpenAI
- Anthropic Claude 3.7. Extended thinking and visible thoughts (research preview) with safety trade‑offs; performance scales with “thinking tokens” and parallel sampling. Anthropic
- Google DeepMind Gemini 2.5. Reports of ICPC‑level wins and silver‑level IMO math via formal methods point to rapid progress—but compute, bespoke training, and task specificity complicate “general reasoning” claims. The Guardian
8) Practical guidance: using “reasoners” in the real world
- Prefer “verify‑then‑trust.” Ask models to explain, check, and re‑solve with Self‑Consistency or ensemble prompts; when possible, verify with tools (calculators, code, retrieval). arXiv
- Escalate compute on hard cases. Allow more steps / samples (ToT, extended thinking) on high‑stakes queries; cap budgets elsewhere. arXiv
- Formalize critical logic. For domains like math, law, safety rules, prefer symbolic checks or formal verification where feasible. Google DeepMind
- Benchmark smartly. Track progress on GPQA / MMLU‑Pro and domain‑specific evals, but beware contamination and format artifacts. arXiv
9) Expert voices — short quotes you can cite
- Yann LeCun (Meta): “Mode‑2: reasoning and planning using the world model.” OpenReview
- OpenAI (o3): “Our most powerful reasoning model… pushes the frontier across coding, math, science.” OpenAI
- OpenAI (o1): “Designed to spend more time thinking before [it] respond[s].” OpenAI
- Anthropic: “Users can toggle ‘extended thinking mode’… and set a ‘thinking budget’.” Anthropic
- Sir Tim Gowers (on DeepMind’s solution): “Very impressive, and well beyond what I thought was state of the art.” Google DeepMind
- Quoc Le (DeepMind): “For me it’s a moment… equivalent to Deep Blue [and] AlphaGo.” The Guardian
- Stuart Russell (Berkeley): “Claims of epochal significance seem overblown.” The Guardian
- Melanie Mitchell (SFI): “No current AI system is anywhere close to humanlike abstractions or analogies.” PubMed
- Yejin Choi (UW/AI2): “Common sense is the dark matter of intelligence.” Quanta Magazine
- Judea Pearl (UCLA): “Merely collecting big data would not have helped us go up the ladder.” bayes.cs.ucla.edu
- Gary Marcus: “We found no evidence of formal reasoning in language models.” garymarcus.substack.com
10) What to watch between now and 2026
- Reasoning as a service knob (thinking budgets, visible/hidden thoughts, proof‑carrying answers). Anthropic
- Formalization pipelines (natural‑language → formal logic → verified proofs) expanding beyond math to regulation and safety policies. Google DeepMind
- World‑model agents (long‑horizon planning, memory, tool use) getting more robust and sample‑efficient. OpenReview
- Next‑gen benchmarks combining hard science, multi‑step math, and tool‑use under strict anti‑contamination protocols. arXiv
Sources & further reading (selected)
- LeCun, “A Path Towards Autonomous Machine Intelligence” (world‑model reasoning). OpenReview
- OpenAI, Introducing o1 / Introducing o3 (deliberation; “most powerful reasoning model”). OpenAI
- Anthropic, “Claude’s extended thinking” (thinking mode & budget; scaling test‑time compute). Anthropic
- DeepMind, AlphaProof & AlphaGeometry 2—IMO silver (formal math, hybrid neuro‑symbolic). Google DeepMind
- The Guardian, Gemini 2.5 programming result (quotes from Quoc Le, Stuart Russell). The Guardian
- Benchmarks: GPQA; MMLU‑Pro; analysis of benchmark pitfalls. arXiv
- Methods: Chain‑of‑Thought, Self‑Consistency, Tree‑of‑Thoughts, ReAct. arXiv
- Perspectives: Mitchell on abstraction/analogy; Choi on commonsense; Pearl on causality; Marcus on limits of LLM reasoning. garymarcus.substack.com
Bottom line
AI systems are getting better at behaving like reasoners—especially when they’re allowed to think longer, branch their thoughts, use tools, or prove what they claim. But human‑level abstraction, commonsense, and causal understanding remain open problems. Expect more knobs for thinking, more hybrid (neuro‑symbolic) pipelines, and tougher, cleaner benchmarks to separate genuine reasoning from clever pattern matching.