Home > Article > Technology peripherals > Mobile phone runs Microsoft's small model better than large model with 2.7 billion parameters
Microsoft CEO Nadella announced at the Ignite conference last month that the Phi-2 small-scale model will be fully open source. This move will significantly improve the performance of common sense reasoning, language understanding and logical reasoning
Today, Microsoft announced more of the Phi-2 model Details and the new prompting technology promptbase. This model with only 2.7 billion parameters outperforms Llama2 7B, Llama2 13B, Mistral 7B, and closes the gap (or even better) with Llama2 70B on most common sense reasoning, language understanding, mathematics, and coding tasks.
At the same time, the small-sized Phi-2 can run on mobile devices such as laptops and mobile phones. Nadella said that Microsoft is very happy to share its best-in-class small language model (SLM) and SOTA prompt technology with R&D developers.
Microsoft published a paper called "Textbook Only" in June this year, using a "textbook" containing only 7B tags Quality" data to train a model containing 1.3B parameters, namely phi-1. Despite having datasets and model sizes that are orders of magnitude smaller than competitors, phi-1 achieves a first-time pass rate of 50.6% in HumanEval and an accuracy of 55.5% in MBPP. phi-1 proved that even high-quality "small data" can lead to good model performance
Microsoft subsequently published "Just a Textbook II: Phi-1.5" in September Technical Report", further research into the potential of high-quality "small data". The article proposes Phi-1.5, which is suitable for QA Q&A, coding and other scenarios, and can reach a scale of 1.3 billion
Nowadays, Phi-2 with 2.7 billion parameters once again uses "small body ” gives excellent reasoning and language understanding capabilities, demonstrating SOTA performance in basic language models below 13 billion parameters. Thanks to innovations in model scaling and training data management, Phi-2 matches or exceeds models 25 times its size on complex benchmarks.
Microsoft says Phi-2 will be an ideal model for researchers to conduct interpretability exploration, security improvements, or fine-tuning experiments for a variety of tasks. Microsoft has made Phi-2 available in the Azure AI Studio model catalog to facilitate language model development.
The scale of the language model has increased to hundreds of billions of parameters, which has indeed released many new capabilities and redefined nature. Landscapes of language processing. But a question remains: can these new capabilities also be achieved on smaller scale models through training strategy selection (such as data selection)?
The solution provided by Microsoft is to use the Phi series of models to achieve similar performance to large models by training small language models. Phi-2 breaks the scaling rules of traditional language models in two aspects
First, the quality of training data plays a crucial role in model performance. Microsoft takes this understanding to the extreme by focusing on "textbook quality" data. Their training data consists of a specially created comprehensive dataset that teaches the model common sense knowledge and reasoning, such as science, daily activities, and psychology. In addition, they further expand their training corpus with carefully selected web data that is screened for educational value and content quality
Secondly, Microsoft uses innovative technologies to expand, Starting from Phi-1.5 with 1.3 billion parameters, knowledge was gradually embedded into Phi-2 with 2.7 billion parameters. This scaled knowledge transfer accelerates training convergence and significantly improves Phi-2’s benchmark scores.
The following is the comparison graph between Phi-2 and Phi-1.5 for all other tasks except BBH (3-shot CoT) and MMLU (5-shot) Evaluation using 0-shot
Phi-2 is a Transformer-based model , whose goal is to predict the next word. It was trained on synthetic and network datasets, using 96 A100 GPUs, and took 14 days
Phi-2 is a base model and failed Reinforcement learning with human feedback (RLHF) performs alignment and does not perform instruction fine-tuning. Despite this, Phi-2 still performed better in terms of toxicity and bias compared to the tuned existing open source model, as shown in Figure 3 below.
First, the study experimentally compared Phi-2 with common language models on academic benchmarks, Covers multiple categories including:
The Phi-2 model only has 2.7 billion parameters, but its performance surpasses the 7B and 13B Mistral models and the Llama2 model on various aggregation benchmarks. It is worth mentioning that Phi-2 performs better in multi-step inference tasks (i.e. coding and mathematics) compared to the massive 25x Llama2-70B model
In addition, Despite its smaller size, Phi-2's performance is comparable to the recently released Gemini Nano 2
Since many public benchmarks may leak into the training data, the research team believes that the test language The best way to measure model performance is to test it on specific use cases. Therefore, the study evaluated Phi-2 using multiple internal Microsoft proprietary datasets and tasks and again compared it with Mistral and Llama-2. On average, Phi-2 outperformed Mistral-7B and Mistral -7B outperforms the Llama2 model (7B, 13B, 70B).
## The research team also conducted a survey on common research community tips Extensively tested. Phi-2 performed as expected. For example, for a prompt used to evaluate a model's ability to solve physics problems (recently used to evaluate the Gemini Ultra model), Phi-2 gave the following results:
The above is the detailed content of Mobile phone runs Microsoft's small model better than large model with 2.7 billion parameters. For more information, please follow other related articles on the PHP Chinese website!