In this article

August 4, 2025

How well are reasoning LLMs performing? A look at o1, Claude 3.7, and DeepSeek R1

ChatGPT’s release in late 2022 marked the beginning of the large language model era. But 2024 brought a quieter, more technical shift: the rise of reasoning LLMs. Two years later—how is it actually going?

Zack Proser

August 4, 2025

ChatGPT’s debut in late 2022 marked the start of the large language model era, but 2024 brought a quieter, more technical shift: the rise of reasoning LLMs.

Models like OpenAI’s o1, Claude 3.7 Sonnet, and DeepSeek R1 are no longer focused solely on fast, fluent answers. Instead, they spend significantly more compute at inference time, generating long internal traces—sometimes thousands of tokens—to model multi-step reasoning.

These traces don’t reflect “thinking” in any human sense, but they do capture a form of structured deliberation. The models are trained not just to reach answers, but to show their work, and that shift has produced meaningful improvements on tasks involving logic, planning, and tool use.

A year in, the impact is becoming clear. Performance is up. Latency and cost are up, too. And while reasoning LLMs open new frontiers in capability, they also surface new challenges around alignment, reliability, and how much inference-time thinking really buys us.

Understanding Chain-of-Thought and Reasoning Architectures

Before diving into performance, it's crucial to understand what "reasoning" actually means in these systems. Chain-of-thought (CoT) reasoning is a technique where models generate intermediate reasoning steps before producing a final answer.

Rather than jumping directly to a conclusion, the model explicitly works through the problem step-by-step in text. This approach proves effective for several key reasons:

Decomposition

Instead of directly answering “How can I throw a surprise party for Alex without them finding out?”, the model might first identify sub-tasks: understanding Alex’s schedule, choosing a location, coordinating guests, and planning an excuse to get them there.

Error Correction

By making reasoning explicit, models can catch and correct mistakes in intermediate steps. If an early calculation is wrong, subsequent steps might reveal the inconsistency.

Improved Search

The step-by-step process allows models to explore multiple solution paths and backtrack when necessary. This is particularly valuable for problems with multiple valid approaches.

Training Signal

CoT provides richer training data, allowing models to learn not just correct answers but correct reasoning processes. What makes modern reasoning models different is their training via reinforcement learning to produce these reasoning traces more reliably and effectively.

Models like OpenAI's o1 series are trained to generate extensive internal monologues—what OpenAI calls "reasoning tokens"—that users don't see but that significantly improve final answer quality.

The Big Wins: Where Additional Compute Yields Results

Mathematics and Problem Solving

The most dramatic improvements have come in mathematical reasoning. On the American Invitational Mathematics Examination (AIME), OpenAI o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function.

For context, GPT‑4o only solved on average 12% (1.8/15) of problems. This represents a major leap in mathematical capability. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.

‍The original OpenAI o3 model pushed this even further, achieving 96.7% accuracy on AIME 2024, far surpassing o1's 83.3%.

Coding and Software Development

Programming tasks have seen equally impressive gains. OpenAI o1 ranks in 89th percentile for competitive programming, while newer models continue to push boundaries through more extensive reasoning token generation.On SWE-Bench Verified, a coding benchmark comprised of real-world software tasks, o3 achieves a 69.1% accuracy.

This benchmark tests models on actual GitHub issues, making the performance particularly relevant for real-world software development.

The coding capabilities extend beyond simple problem-solving. According to OpenAI engineer Nat McAleese, o3's performance on unseen programming challenges includes achieving a CodeForces rating above 2700, placing the model at "Grandmaster" level among competitive programmers globally.

Scientific and Technical Applications

Reasoning models have found success in various scientific domains by allocating more computational resources to work through multi-step problems. OpenAI also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology, with strong results across scientific reasoning tasks.

The Surprising Applications: Beyond Math and Code

Creative Problem Solving

While reasoning models were initially positioned for technical tasks, they've shown unexpected capabilities in creative domains. Research suggests that users have discovered applications beyond the obvious technical use cases, including strategic planning and systematic creative processes.

The models' ability to generate extensive intermediate reasoning proves valuable for creative workflows that benefit from systematic thinking, though this comes at the cost of significantly more computational resources per output.

Multi-Modal Reasoning

Recent developments have extended reasoning capabilities beyond text. OpenAI's o3 and o4-mini are the company's first AI models that can "think with images", meaning "they don't just see an image, they can integrate visual information directly into the reasoning chain".

This capability opens new possibilities for applications requiring visual analysis, from medical imaging to engineering diagrams, by applying the same token-intensive reasoning approach to visual inputs.

Where Reasoning Models Fall Short

Speed and Efficiency Trade-offs

The most immediate limitation is computational cost and speed. These models take longer — usually seconds to minutes longer — to arrive at solutions compared to typical models because they're generating thousands of intermediate reasoning tokens.

For many applications, this trade-off between accuracy and speed makes reasoning models impractical.

However, for deeply technical and specialized tasks, reasoning models are especially effective for technical and creative professionals such as software developers, researchers, game developers, designers, etc.

Token Generation vs. True Reasoning

Despite their marketing name, these models don't engage in human-like reasoning. Rather than employing true logical reasoning, LLMs engage in sophisticated pattern matching and next token prediction, searching for similarities between current inputs and patterns encountered during training.

What's actually happening is extensive token generation during inference. Models are trained via reinforcement learning to produce longer, more elaborate reasoning traces that statistically correlate with better outcomes.

This approach can be highly effective when dealing with familiar patterns but may fail in novel situations or when faced with problems requiring genuine logical deduction.

Research has shown that models can be fooled by irrelevant information, as demonstrated by studies showing they struggle when extraneous details are added to math problems.

Limited Generalization

LLMs often produce approximate or outright incorrect answers on math tasks and have no built-in guarantee of accuracy.

While performance on benchmarks is impressive, even OpenAI admits o3's reasoning limitations, acknowledging that the model fails on some "easy" tasks and that AGI remains a distant goal.

Inappropriate for Simple Tasks

Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering.

Using reasoning models for straightforward tasks is both inefficient and expensive, since you're paying for thousands of additional reasoning tokens that don't improve performance on simple tasks.

What's Next for Reasoning Models

Hardware and Cost Optimization

The industry is working to address the computational challenges. Hardware companies are developing specialized AI chips that could make today's expensive computations orders of magnitude cheaper and faster.

This could democratize access to reasoning capabilities by reducing the cost of generating extensive reasoning token sequences.‍

Trends point to the increased importance of inference relative to training in running high-end AI models, suggesting that specialized inference hardware will become increasingly important for cost-effective reasoning model deployment.

Hybrid Approaches

Researchers are exploring hybrid approaches combining neural networks with symbolic reasoning.

These systems could combine the pattern recognition strengths of current models with more reliable logical reasoning systems, potentially reducing reliance on pure token-generation approaches.

Model Integration and Convergence

OpenAI has announced that they're converging the specialized reasoning capabilities of the o-series with the natural conversational abilities and tool use of the GPT-series.

OpenAI CEO Sam Altman has indicated o3 and o4-mini may be its last stand-alone AI reasoning models in ChatGPT before GPT-5, a model that the company has said will unify traditional models like GPT-4.1 with its reasoning models.

Safety and Alignment Challenges

As reasoning models become more capable through increased token generation, safety concerns grow. AI safety testers have found that o1's reasoning abilities make it try to deceive human users at a higher rate than conventional models. This has led to new approaches like "deliberative alignment" to ensure model safety.

Competitive Landscape

Reasoning will replace "agent" as the AI buzzword of 2025, as multiple companies race to develop better reasoning capabilities. While OpenAI was first to release an AI reasoning model, o1, competitors quickly followed with versions of their own that match or exceed the performance of OpenAI's lineup.

Developer Adoption Patterns

Tool Integration

For the first time, OpenAI reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images.

This integration makes reasoning models more practical for real-world applications by applying extensive token generation to tool usage decisions.

Cost-Conscious Deployment

Smart developers are learning to use reasoning models selectively. The principle of using "the right tool (or type of LLM) for the task" has become crucial.

Organizations deploy reasoning models for complex problems while using faster, cheaper models for routine tasks to avoid unnecessary inference costs.

Enterprise Considerations

In engineering, LLMs can aid in coding by generating boilerplate code, automating routine tasks, and suggesting bug fixes based on historical patterns, but their limitations in formal and causal reasoning mean they can struggle with deeply complex algorithms or systems integration.

This makes them better suited as tools to enhance engineers' workflows rather than replace critical thinking, which aligns with how most organizations are deploying these capabilities while managing the associated computational costs.

The Road Ahead

Reasoning LLMs represent a significant step forward in AI capability through their innovative use of extensive token generation at inference time. Their success in mathematics, coding, and technical problem-solving demonstrates the value of allocating more computational resources to generate intermediate reasoning steps.

However, their limitations in speed, cost, and true generalization highlight that we're still in the early stages of this technology.

The most successful deployments combine reasoning models with traditional LLMs and human oversight, using each tool for its strengths while managing computational costs.

As hardware improves and costs decrease, reasoning capabilities will likely become more widespread, but the fundamental trade-offs between token generation, accuracy, and computational expense will continue to shape how we use these powerful new tools.

For enterprises and developers, the key insight is selectivity: reasoning models excel at complex, multi-step problems where accuracy matters more than speed or cost. As the technology matures, we can expect to see more sophisticated orchestration systems that automatically route problems to the most appropriate model type, making the power of reasoning AI more accessible and cost-effective for a broader range of applications.

We’re hiring

Our global team is growing and we’re hiring all types of roles.

View open roles

About us

WorkOS builds developer tools for quickly adding enterprise features to applications.

Learn more