search
HomeTechnology peripheralsAIMicrosoft's Phi-4 Reasoning Models Explained Simply

Microsoft isn’t like OpenAI, Google, and Meta; especially not when it comes to large language models. While other tech giants prefer to launch multiple models almost overwhelming the users with choices; Microsoft launches a few, but those models always make it big among developers around the world. In their latest release, they have released 2 reasoning models: Phi-4-Reasoning and Phi-4-Reasoning-plus, both trained on the base Phi-4 model. The two Phi-4-Reasoning models compete with the mighty models like o1, o3-mini, and DeepSeek R1. In this blog, we will dive into the technical details, architecture, training methods, and performance of Phi-4-Reasoning models in detail.

Let’s explore the Phi-4-Reasoning models.

Table of contents

  • What is Phi-4 Reasoning?
  • Phi 4 Reasoning Models
  • Key Features of Phi-4-Reasoning Models
    • Data Centric Training 
    • Supervised Fine-Tuning (SFT)
    • Reinforcement Learning
  • Architecture of Phi-4-Reasoning Models
  • Phi-4-Reasoning Models: Benchmark Performance 
  • How to Access Phi-4-Reasoning Models?
  • Phi-4-Reasoning: HandsOn Applications
    • Task 1: Logical Thinking
    • Task 2: Explain Working of LLMs to an 8 Year Old Kid
  • Phi-4 Reasoning vs o3-mini: Comparison
  • Applications of Phi-4-Reasoning Models
  • Conclusion

What is Phi-4 Reasoning?

Phi-4 is not new in the LLM world. This small and mighty language models broke the internet when it was launched last year. Now to cater to the increasing demand for reasoning models, Microsoft has released Phi-4-Reasoning models. These are 14B parameters that excel at performing complex reasoning tasks involving mathematics, coding, and STEM questions. Unline its general purpose Phi-4 series, Phi-4-Reasoning is specifically optimized for long-chain reasoning – that is the ability to break down complex multi-step problems systematically into logical steps.

Also Read: Phi-4: Redefining Language Models with Synthetic Data

Phi 4 Reasoning Models

The two reasoning models released by Microsoft are:

  • Phi-4-Reasoning: A reasoning model trained using supervised fine-tuning or SFT on high-quality datasets. The model is preferred for all tasks that require faster responses with guided performance constraints.
  • Phi-4-Reasoning-Plus: An enhanced reasoning model that has been enhanced using reinforcement learning or RL to improve its performance but generates almost 50% more tokens compared to its counterpart. The model shows an increased latency and hence is recommended for high-accuracy tasks.

The two 14B models currently support only text input, and Microsoft has released them as open-weight so developers can freely test and fine-tune them based on their needs. Here are some key highlights of the models:

Details Phi-4-Reasoning Models
Developer Microsoft Research
Model Variants Phi-4-Reasoning, Phi-4-Reasoning-Plus
Base Architecture Phi-4 (14B parameters), dense decoder-only Transformer
Training Method Supervised fine-tuning on chain-of-thought data; Plus variant includes additional Reinforcement Learning (RLHF)
Training Duration 2.5 days on 32× H100-80G GPUs
Training Data 16B tokens total (~8.3B unique), from synthetic prompts and filtered public domain data
Training Period January – April 2025
Data Cutoff March 2025
Input Format Text input, optimized for chat-style prompts
Context Length 32,000 tokens
Output Format Two sections: reasoning chain-of-thought block followed by a summarization block
Release Date April 30, 2025

Key Features of Phi-4-Reasoning Models

For Phi-4 the team took several innovative steps involving data selection, its training methodology as well as its performance. Some of the key things they did were:

Data Centric Training

The Data Curation for training the Phi-4 reasoning models relied not just on sheer quantity but emphasized equally on the quality of data too. They specifically chose the data that was at the “edge” of the model’s capabilities. This ensured that the training data was solvable but not easily.

The main steps involved in building the data set for Phi-4 models were:

  • Seed Database: The Microsoft team started with publicly available datasets like AIME and GPQA. These data sets involved problems in algebra and geometry involving multi-step reasoning. 
  • Synthetic Reasoning Chains: To get comprehensive and detailed step-by-step reasoned-out responses for the problems, the Microsoft team relied on OpenAI’s o3-mini model.

For example, for the question “What is the derivative of sin(x2)?”; o3-mini gave the following output:

Step 1: Apply the chain rule: d/dx sin(u)=cos(u)*du/dx.<br><br>Step 2: Let u=x² ⇒ du/dx=2x.<br><br>Final Answer: cos(x²) * 2x.

These artificially or synthetically generated chains of well-reasoned responses gave a clear blueprint on how a model should structure its own reasoning responses.

  • Selecting “Teachable Moments”: The developer team, knowingly went for prompts that challenged the base Phi-4 model while being solvable. These included problems on which Phi-4 initially showed around 50% accuracy. This approach ensured that the training process avoided “easy” data that just reinforced existing patterns, and focused more on “structured reasoning”.

The team essentially wanted the Phi-4-reasoning models to learn as they do, an approach that we humans usually rely on.

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is the process of improving a pre-trained language model by training it on carefully selected input–output pairs with high-quality responses. For the Phi-4-Reasoning models, this meant starting with the base Phi-4 model and then refining it using reasoning-focused tasks. Essentially, Phi-4-Reasoning was trained to learn and follow the step-by-step reasoning patterns seen in responses from o3-mini.

Training Details

  • Batch Size: It was kept at 32. This small batch size allowed the model to focus on individual examples without being overwhelmed by the additional noise.
  • Learning rate: This was 7e-5, a moderate rate that avoids overshooting the optimal weights during updates.
  • Optimizer: A standard “Adam W” optimizer was used. This deep learning optimizer balances speed and stability.
  • Context Length: It was taken to 32,768 tokens which was double the 16K token limit of the base Phi-4 model. This allowed the model to handle a longer context.

Using SFT during early training allowed the model to use and tokens to separate raw input from its internal reasoning. This structure made its decision-making process transparent. Also, the model showed steady improvements on the AIME benchmarks proving that the model was not just copying formats but was building reasoning logic.

Reinforcement Learning

Reinforcement learning is teaching a model how to do better with feedback on all its generated outputs. The model gets a reward every time it answers correctly and is punished each time it responds incorrectly. RL was used to further train the Phi-4-Reasoning -Plus model. This training method refined the model’s math-solving skills which evaluated the responses for accuracy and the structured approach.

How does RL work?

  • Reward Design: The model got 1 for each correct response and -0.5 for incorrect response. The model got punished for repetitive phrases “Let’s see.. Let’s see..” etc.
  • Algorithm: The GRPO algorithm or generalized reward policy optimization algorithm, which is a variant of RL that balances exploration and exploitation was used.
  • Results: Phi-4-Reasoning-Plus achieved 82.5% accuracy on AIME 2025 while Phi-4 Reasoning scored just 71.4%. It showed improved performance on Omni-MATH and TSP (traveling Salesman Problem) too.

RL training allowed the model to refine its steps iteratively and helped reduce the “hallucinations” in the generated outputs.

Architecture of Phi-4-Reasoning Models

The main architecture of the Phi-4-Reasoning models is similar to the base Phi-4 model but to support the “reasoning” tasks some key modifications were made.

  1. The two placeholder tokens from Phi-4 were repurposed. These tokens helped the model to differentiate between raw input and internal reasoning.
    • : Used to mark the start of a reasoning block.
    • : Used to mark the end of a reasoning block.
  2. The Phi-4 Reasoning models got an extended context window of 32K tokens to handle the extra reasoning chains. 
  3. The models used rotary position embeddings to better track the position of tokens in long sequences to help the models maintain coherency.
  4. The models are trained to work efficiently on consumer hardware including devices like mobiles, tablets and desktops.

Phi-4-Reasoning Models: Benchmark Performance

Phi-4-Reasoning models were evaluated on various benchmarks to test their performance against different models on varying tasks.

Microsoft's Phi-4 Reasoning Models Explained Simply

  • AIME 2025: A benchmark that tests advanced math, reasoning, and recent exam difficulty.Phi-4-Reasoning Plus outperforms most of the top-performing models like o1, and Claude 3.7 Sonnet but is still behind o3-mini-high.
  • Omni-MATH: A benchmark that evaluates diverse math reasoning across topics and levels. Both Phi-4-Reasoning and Phi-4-Reasoning plus outperform almost all models only behind DeepSeek R1.
  • GPQA: A benchmark that tests model performance on graduate-level professional QA reasoning. The two Phi reasoning models lag behind the giants like o1, o3-mini high and DeepSeek R1.
  • SAT: A benchmark that evaluates U.S. high school-level academic reasoning (math verbal blend). The Phi-4-Reasoning-Plus model stands among the top 3 contenders with Phi-4-Reasoning following close behind.
  • Maze: This benchmark tests navigation decision pathfinding reasoning. On this benchmark, the Phi-4-reasoning models lag behind the top-tier models like o1 and Claude 3.7 sonnet.

On other benchmarks like Spatial map, TSP, and BA calendar, both the Phi-4-Reasoning models perform decently.

Also Read: How to Fine-Tune Phi-4 Locally?

How to Access Phi-4-Reasoning Models?

The two Phi-4-Reasoning models are available on Hugging Face:

  • Phi-4 Reasoning
  • Phi-4 Reasoning-Plus

Click on the links to head to the hugging face page where you can access these models. On the right side corner of the screen, click on “Use This Model”, click on “Transformers” and copy the following code:

# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]

pipe = pipeline("text-generation", model="microsoft/Phi-4-reasoning")
pipe(messages)

Since it is a 14B parameter model and hence requires around 40 GB of VRAM (GPU), You can either run these models on “Colab Pro” or “Runpod”. For this blog, we ran the model on “Runpod” and used “A100 GPU”. 

Install Required Libraries

First, ensure you have the transformer’s library installed. You can install it using pip:

pip install transformers

Load the Model

Once all the libraries have been installed, you can now load the Phi-4-Reasoning model on your notebook:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-4-reasoning", max_new_tokens=4096)

Make sure to set the max_new_tokens = 4096, the model generates its entire reasoning and often lesser token count can stop its output midway.

Phi-4-Reasoning: HandsOn Applications

We will now test the Phi-4-reasoning models for two tasks involving Logical Thinking and Reasoning. Let’s start.

Task 1: Logical Thinking

Input:

messages = [

{"role": "user", "content": """A team is to be selected from among ten persons — A, B, C, D, E, F, G, H, I and J — subject to the following conditions.

Exactly two among E, J, l and C must be selected.

If F is selected, then J cannot be selected.

Exactly one among A and C must be selected.

Unless A is selected, E cannot be selected,

If and only if G is selected, D must not be selected.

If D is not selected, then H must be selected.

The size of a team is defined as the number of members in the team. In how many ways can the team of size 6 be selected, if it includes E? and What is the largest possible size of the team?"""

},

]

Output:

Markdown(pipe(messages)[0]["generated_text"][1]["content"])

The model thinks thoroughly. It does a great job of breaking down the entire problem into small steps. The problem consists of two tasks, with the given token window, it gave the answer for the first task but it could not generate the answer for the second task. What was interesting was the approach that the model took towards solving the given problem. First, it started by understanding the question, mapping out all the possibilities, and then it went ahead into solving each task, sometimes, repeating the logic that it had pre-established.

Task 2: Explain Working of LLMs to an 8 Year Old Kid

Input:

messages = [

{"role": "user", "content": """Explain How LLMs works by comparing their working to the photosynthesis process in a plant so that an 8 year old kid can actually understand"""

},

]

Output:

Markdown(pipe(messages)[0]["generated_text"][1]["content"])

The model hallucinates a bit while generating the response for this problem. Then finally it generates the response that provides a good analogy between how LLMs work and the photosynthesis process. It keeps the language simple and finally adds a disclaimer too.

Phi-4 Reasoning vs o3-mini: Comparison

In the last section, we saw how the Phi-4-Reasoning model performs while dealing with complex problems. Now let’s compare its performance against OpenAI’s o3-mini. To do this, let’s test the output generated by the two models for the same task.

Phi-4-Reasoning

Input:

from IPython.display import Markdown

messages = [

{"role": "user", "content": """Suppose players A and B are playing a game with fair coins. To begin the game A and B

both flip their coins simultaneously. If A and B both get heads, the game ends. If A and B both get tails, they both

flip again simultaneously. If one player gets heads and the other gets tails, the player who got heads flips again until he

gets tails, at which point the players flip again simultaneously. What is the expected number of flips until the game ends?"""

},

]

Output = pipe(messages)

Output:

Markdown(Output[0]["generated_text"][1]["content"])

Microsoft's Phi-4 Reasoning Models Explained Simply

o3-mini

Input:

response = client.responses.create(

model="o3-mini",

input="""Suppose players A and B are playing a game with fair coins. To begin the game A and B

both flip their coins simultaneously. If A and B both get heads, the game ends. If A and B both get tails, they both

flip again simultaneously. If one player gets heads and the other gets tails, the player who got heads flips again until he

gets tails, at which point the players flip again simultaneously. What is the expected number of flips until the game ends?"""

)

Output:

print(response.output_text)

Microsoft's Phi-4 Reasoning Models Explained Simply

To check the detailed output you can refer to the following Github link.

Result Evaluation

Both models give accurate answers. Phi-4-Reasoning breaks the problem into many detailed steps and thinks through each one before reaching the final answer. o3-mini, on the other hand, combines its thinking and final response more smoothly, making the output clear and ready to use. Its answers are also more concise and direct.

Applications of Phi-4-Reasoning Models

The Phi-4-Reasoning models open a world of possibilities. Developers can use these models to develop intelligent systems to cater to different industries. Here are a few areas where the Phi-4-Reasoning models can truly excel:

  • Their strong performance in coding benchmarks (like LiveCodeBench) suggests applications in code generation, debugging, algorithm design, and automated software development.
  • Their ability to generate detailed reasoning chains makes them well-suited for answering complex questions that require multi-step inference and logical deduction.
  • The models’ abilities in planning tasks could be leveraged in logistics, resource management, game-playing, and autonomous systems requiring sequential decision-making.
  • The models can also contribute to designing systems in robotics, autonomous navigation, and tasks involving the interpretation and manipulation of spatial relationships.

Conclusion

The Phi-4-Reasoning models are open-weight and built to compete with top paid reasoning models like DeepSeek and OpenAI’s o3-mini. Since they are not instruction-tuned, their answers may not always follow a clear, structured format like some popular models, but this can improve over time or with custom fine-tuning. Microsoft’s new models are powerful reasoning tools with strong performance, and they’re only going to get better from here.

The above is the detailed content of Microsoft's Phi-4 Reasoning Models Explained Simply. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
7 Powerful AI Prompts Every Project Manager Needs To Master Now7 Powerful AI Prompts Every Project Manager Needs To Master NowMay 08, 2025 am 11:39 AM

Generative AI, exemplified by chatbots like ChatGPT, offers project managers powerful tools to streamline workflows and ensure projects stay on schedule and within budget. However, effective use hinges on crafting the right prompts. Precise, detail

Defining The Ill-Defined Meaning Of Elusive AGI Via The Helpful Assistance Of AI ItselfDefining The Ill-Defined Meaning Of Elusive AGI Via The Helpful Assistance Of AI ItselfMay 08, 2025 am 11:37 AM

The challenge of defining Artificial General Intelligence (AGI) is significant. Claims of AGI progress often lack a clear benchmark, with definitions tailored to fit pre-determined research directions. This article explores a novel approach to defin

IBM Think 2025 Showcases Watsonx.data's Role In Generative AIIBM Think 2025 Showcases Watsonx.data's Role In Generative AIMay 08, 2025 am 11:32 AM

IBM Watsonx.data: Streamlining the Enterprise AI Data Stack IBM positions watsonx.data as a pivotal platform for enterprises aiming to accelerate the delivery of precise and scalable generative AI solutions. This is achieved by simplifying the compl

The Rise of the Humanoid Robotic Machines Is Nearing.The Rise of the Humanoid Robotic Machines Is Nearing.May 08, 2025 am 11:29 AM

The rapid advancements in robotics, fueled by breakthroughs in AI and materials science, are poised to usher in a new era of humanoid robots. For years, industrial automation has been the primary focus, but the capabilities of robots are rapidly exp

Netflix Revamps Interface — Debuting AI Search Tools And TikTok-Like DesignNetflix Revamps Interface — Debuting AI Search Tools And TikTok-Like DesignMay 08, 2025 am 11:25 AM

The biggest update of Netflix interface in a decade: smarter, more personalized, embracing diverse content Netflix announced its largest revamp of its user interface in a decade, not only a new look, but also adds more information about each show, and introduces smarter AI search tools that can understand vague concepts such as "ambient" and more flexible structures to better demonstrate the company's interest in emerging video games, live events, sports events and other new types of content. To keep up with the trend, the new vertical video component on mobile will make it easier for fans to scroll through trailers and clips, watch the full show or share content with others. This reminds you of the infinite scrolling and very successful short video website Ti

Long Before AGI: Three AI Milestones That Will Challenge YouLong Before AGI: Three AI Milestones That Will Challenge YouMay 08, 2025 am 11:24 AM

The growing discussion of general intelligence (AGI) in artificial intelligence has prompted many to think about what happens when artificial intelligence surpasses human intelligence. Whether this moment is close or far away depends on who you ask, but I don’t think it’s the most important milestone we should focus on. Which earlier AI milestones will affect everyone? What milestones have been achieved? Here are three things I think have happened. Artificial intelligence surpasses human weaknesses In the 2022 movie "Social Dilemma", Tristan Harris of the Center for Humane Technology pointed out that artificial intelligence has surpassed human weaknesses. What does this mean? This means that artificial intelligence has been able to use humans

Venkat Achanta On TransUnion's Platform Transformation And AI AmbitionVenkat Achanta On TransUnion's Platform Transformation And AI AmbitionMay 08, 2025 am 11:23 AM

TransUnion's CTO, Ranganath Achanta, spearheaded a significant technological transformation since joining the company following its Neustar acquisition in late 2021. His leadership of over 7,000 associates across various departments has focused on u

When Trust In AI Leaps Up, Productivity FollowsWhen Trust In AI Leaps Up, Productivity FollowsMay 08, 2025 am 11:11 AM

Building trust is paramount for successful AI adoption in business. This is especially true given the human element within business processes. Employees, like anyone else, harbor concerns about AI and its implementation. Deloitte researchers are sc

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!