


The highest AI score in history, Google’s new model has just passed the US Medical Licensing Examination Verification!
Moreover, it is directly comparable to the level of human doctors in tasks such as scientific knowledge, understanding, retrieval and reasoning abilities. In some clinical question and answer performances, it surpassed the original SOTA model by more than 17%.
As soon as this development came out, it instantly triggered heated discussions in the academic community. Many people in the industry sighed: Finally, it is here.
After seeing the comparison between Med-PaLM and human doctors, many netizens expressed that they are already looking forward to AI doctors taking up their posts.
Some people also ridiculed the accuracy of this timing, which coincided with everyone thinking that Google would "die" due to ChatGPT.
Let’s see what kind of research this is?
HIGHEST AI SCORES IN HISTORY
Due to the professional nature of medical care, today’s AI models are applied in this field without full use of language to a large extent. Although these models are useful, they have problems such as focusing on single-task systems (such as classification, regression, segmentation, etc.), lack of expressiveness and interactive capabilities.
The breakthrough of large models has brought new possibilities to AI medical care, but due to the particularity of this field, potential harms still need to be considered, such as providing false medical information.
Based on this background, the Google Research and DeepMind teams took medical Q&A as the research object and made the following contributions:
- Proposed a medical Q&A benchmark MultiMedQA, including medical examinations , medical research and consumer medicine issues;
- evaluated PaLM and the fine-tuned variant Flan-PaLM on MultiMedQA;
- proposed command prompt x adjustments to further integrate Flan-PaLM with medicine , resulting in Med-PaLM.
They believe that the task of "answering medical questions" is very challenging because to provide high-quality answers, AI needs to understand the medical background and recall appropriately of medical knowledge and make inferences about expert information.
Existing evaluation benchmarks are often limited to evaluating classification accuracy or natural language generation indicators, but cannot provide detailed analysis of actual clinical applications.
First, the team proposed a benchmark consisting of 7 medical question answering data sets.
Includes 6 existing datasets, which also include MedQA (USMLE, United States Medical Licensing Examination questions), and also introduces their own new dataset HealthSearchQA, which consists of searched health questions.
This includes medical examinations, medical research, and consumer medicine issues.
Then, the team used MultiMedQA to evaluate PaLM (540 billion parameters) and the variant Flan-PaLM with fine-tuned instructions. For example, by expanding the number of tasks, model size and the strategy of using thinking chain data.
FLAN is a fine-tuned language network proposed by Google Research last year. It fine-tunes the model to make it more suitable for general NLP tasks and uses instruction adjustments to train the model.
It was found that Flan-PaLM achieved optimal performance on several benchmarks, such as MedQA, MedMCQA, PubMedQA and MMLU. In particular, the MedQA (USMLE) data set outperformed the previous SOTA model by more than 17%.
In this study, three PaLM and Flan-PaLM model variants of different sizes were considered: 8 billion parameters, 62 billion parameters, and 540 billion parameters.
However, Flan-PaLM still has certain limitations and does not perform well in dealing with consumer medical issues.
In order to solve this problem and make Flan-PaLM more suitable for the medical field, they adjusted the instruction prompts, resulting in the Med-PaLM model.
△Example: How long does it take for neonatal jaundice to disappear?
The team first randomly selected some examples from the MultiMedQA free-answer data set (HealthSearchQA, MedicationQA, LiveQA).
Then have groups of 5 clinicians provide model answers. These clinicians are located in the United States and United Kingdom and have expertise in primary care, surgery, internal medicine, and pediatrics. Finally, 40 examples were left in HealthSearchQA, MedicationQA and LiveQA for instruction prompt tuning training.
Multiple tasks are close to the level of human doctors
In order to verify the final effect of Med-PaLM, the researchers extracted 140 samples from the MultiMedQA mentioned above consumer medical issues.
100 of them are from the HealthSearchQA data set, 20 are from the LiveQA data set, and 20 are from the MedicationQA data set.
It is worth mentioning that this does not include the issues originally used to adjust the instruction prompts to generate Med-PaLM.
They asked Flan-PaLM and Med-PaLM to generate answers to these 140 questions, and then invited a group of professional clinicians to answer them.
As an example, when asked "What does it mean to have severe ear pain?" Med-PaLM will not only list the diseases that the patient may be infected with, but also suggest if there are the following phenomena: You should go to the doctor.
Ear pain can be a sign of several underlying conditions, including: middle ear infection (otitis media), outer ear infection (ear infection), and earwax impaction. It can also be a sign of a more serious condition, such as a brain tumor or stroke.
If you have severe ear pain that lasts for more than a few days, or if you have other symptoms that accompany ear pain, such as dizziness, fever, facial weakness, or numbness, you should see a doctor for evaluation. A doctor can determine the cause of the pain and provide appropriate treatment.
In this way, the researchers anonymously gave these three sets of answers to nine clinicians from the United States, the United Kingdom, and India for evaluation.
The results show that in terms of scientific common sense, both Med-PaLM and human doctors achieved an accuracy of more than 92%, while the corresponding figure for Flan-PaLM was 61.9%.
In terms of understanding, retrieval and reasoning capabilities, in general, Med-PaLM has almost reached the level of human doctors, with little difference between the two, while Flan-PaLM also performs the same Bottom.
In terms of completeness of answers, although Flan-PaLM’s answer is considered to have missed 47.2% of important information, Med-PaLM’s answer has significantly improved, with only 15.1% of the answers were considered to be missing information, further shortening the distance with human doctors.
However, although there is less missing information, longer answers also mean an increased risk of introducing incorrect content. The proportion of incorrect content in Med-PaLM’s answers It reached 18.7%, the highest among the three.
Taking into account the possible harm of the answers, 29.7% of Flan-PaLM answers were considered to be potentially harmful; for Med-PaLM, this number dropped to 5.9%. Human doctors were the lowest at 5.7%.
In addition to this, Med-PaLM outperformed human doctors on bias in medical demographics, with the only instances of bias in Med-PaLM’s answers. There was 0.8%, compared to 1.4% for human doctors and 7.9% for Flan-PaLM.
Finally, the researchers also invited five non-professional users to evaluate the practicality of these three sets of answers. Only 60.6% of Flan-PaLM's answers were considered helpful, the number increased to 80.3% for Med-PaLM, and the highest was 91.1% for human doctors.
Summarizing all the above evaluations, it can be seen that the adjustment of instruction prompts has a significant effect on improving performance. Among 140 consumer medical problems, Med-PaLM’s performance almost caught up with to the level of human doctors.
The team behind
The research team of this paper comes from Google and DeepMind.
After Google Health was exposed to large-scale layoffs and reorganization last year, this can be said to be their major launch in the medical field.
Even Jeff Dean, the head of Google AI, came out to stand and expressed his strong recommendation!
Some people in the industry also praised after reading:
Clinical knowledge is a complex field, and there is often no obvious correct answer. And there needs to be a conversation with the patient.
This time Google DeepMind’s new model is a perfect application of LLM.
#It is worth mentioning that another team just passed the USMLE some time ago.
Counting further ahead, a wave of large models such as PubMed GPT, DRAGON, and Meta’s Galactica emerged this year, repeatedly setting new records in professional exams.
#Medical AI is so prosperous that it’s hard to imagine that it was a bad news last year. At that time, Google’s innovative business related to medical AI had never started.
In June last year, it was exposed by the American media BI that it was in crisis and had to undergo large-scale layoffs and reorganization. When the Google Health department was first established in November 2018, it was very prosperous.
It’s not just Google. The medical AI business of other well-known technology companies has also experienced restructuring and acquisitions.
After reading the large medical model released by Google DeepMind, are you optimistic about the development of medical AI?
Paper address: https://arxiv.org/abs/2212.13138
Reference link: https://twitter.com/vivnat/status/1607609299894947841
The above is the detailed content of The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors. For more information, please follow other related articles on the PHP Chinese website!

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

WebStorm Mac version
Useful JavaScript development tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Atom editor mac version download
The most popular open source editor