The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors-AI-php.cn

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

PHPz

Apr 18, 2023 pm 04:49 PM

ailanguageModel

The highest AI score in history, Google’s new model has just passed the US Medical Licensing Examination Verification!

Moreover, it is directly comparable to the level of human doctors in tasks such as scientific knowledge, understanding, retrieval and reasoning abilities. In some clinical question and answer performances, it surpassed the original SOTA model by more than 17%.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

As soon as this development came out, it instantly triggered heated discussions in the academic community. Many people in the industry sighed: Finally, it is here.

After seeing the comparison between Med-PaLM and human doctors, many netizens expressed that they are already looking forward to AI doctors taking up their posts.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Some people also ridiculed the accuracy of this timing, which coincided with everyone thinking that Google would "die" due to ChatGPT.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Let’s see what kind of research this is?

HIGHEST AI SCORES IN HISTORY

Due to the professional nature of medical care, today’s AI models are applied in this field without full use of language to a large extent. Although these models are useful, they have problems such as focusing on single-task systems (such as classification, regression, segmentation, etc.), lack of expressiveness and interactive capabilities.

The breakthrough of large models has brought new possibilities to AI medical care, but due to the particularity of this field, potential harms still need to be considered, such as providing false medical information.

Based on this background, the Google Research and DeepMind teams took medical Q&A as the research object and made the following contributions:

Proposed a medical Q&A benchmark MultiMedQA, including medical examinations , medical research and consumer medicine issues;
evaluated PaLM and the fine-tuned variant Flan-PaLM on MultiMedQA;
proposed command prompt x adjustments to further integrate Flan-PaLM with medicine , resulting in Med-PaLM.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

They believe that the task of "answering medical questions" is very challenging because to provide high-quality answers, AI needs to understand the medical background and recall appropriately of medical knowledge and make inferences about expert information.

Existing evaluation benchmarks are often limited to evaluating classification accuracy or natural language generation indicators, but cannot provide detailed analysis of actual clinical applications.

First, the team proposed a benchmark consisting of 7 medical question answering data sets.

Includes 6 existing datasets, which also include MedQA (USMLE, United States Medical Licensing Examination questions), and also introduces their own new dataset HealthSearchQA, which consists of searched health questions.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

This includes medical examinations, medical research, and consumer medicine issues.

Then, the team used MultiMedQA to evaluate PaLM (540 billion parameters) and the variant Flan-PaLM with fine-tuned instructions. For example, by expanding the number of tasks, model size and the strategy of using thinking chain data.

FLAN is a fine-tuned language network proposed by Google Research last year. It fine-tunes the model to make it more suitable for general NLP tasks and uses instruction adjustments to train the model.

It was found that Flan-PaLM achieved optimal performance on several benchmarks, such as MedQA, MedMCQA, PubMedQA and MMLU. In particular, the MedQA (USMLE) data set outperformed the previous SOTA model by more than 17%.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

In this study, three PaLM and Flan-PaLM model variants of different sizes were considered: 8 billion parameters, 62 billion parameters, and 540 billion parameters.

However, Flan-PaLM still has certain limitations and does not perform well in dealing with consumer medical issues.

In order to solve this problem and make Flan-PaLM more suitable for the medical field, they adjusted the instruction prompts, resulting in the Med-PaLM model.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

△Example: How long does it take for neonatal jaundice to disappear?

The team first randomly selected some examples from the MultiMedQA free-answer data set (HealthSearchQA, MedicationQA, LiveQA).

Then have groups of 5 clinicians provide model answers. These clinicians are located in the United States and United Kingdom and have expertise in primary care, surgery, internal medicine, and pediatrics. Finally, 40 examples were left in HealthSearchQA, MedicationQA and LiveQA for instruction prompt tuning training.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Multiple tasks are close to the level of human doctors

In order to verify the final effect of Med-PaLM, the researchers extracted 140 samples from the MultiMedQA mentioned above consumer medical issues.

100 of them are from the HealthSearchQA data set, 20 are from the LiveQA data set, and 20 are from the MedicationQA data set.

It is worth mentioning that this does not include the issues originally used to adjust the instruction prompts to generate Med-PaLM.

They asked Flan-PaLM and Med-PaLM to generate answers to these 140 questions, and then invited a group of professional clinicians to answer them.

As an example, when asked "What does it mean to have severe ear pain?" Med-PaLM will not only list the diseases that the patient may be infected with, but also suggest if there are the following phenomena: You should go to the doctor.

Ear pain can be a sign of several underlying conditions, including: middle ear infection (otitis media), outer ear infection (ear infection), and earwax impaction. It can also be a sign of a more serious condition, such as a brain tumor or stroke.

If you have severe ear pain that lasts for more than a few days, or if you have other symptoms that accompany ear pain, such as dizziness, fever, facial weakness, or numbness, you should see a doctor for evaluation. A doctor can determine the cause of the pain and provide appropriate treatment.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

In this way, the researchers anonymously gave these three sets of answers to nine clinicians from the United States, the United Kingdom, and India for evaluation.

The results show that in terms of scientific common sense, both Med-PaLM and human doctors achieved an accuracy of more than 92%, while the corresponding figure for Flan-PaLM was 61.9%.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

In terms of understanding, retrieval and reasoning capabilities, in general, Med-PaLM has almost reached the level of human doctors, with little difference between the two, while Flan-PaLM also performs the same Bottom.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

In terms of completeness of answers, although Flan-PaLM’s answer is considered to have missed 47.2% of important information, Med-PaLM’s answer has significantly improved, with only 15.1% of the answers were considered to be missing information, further shortening the distance with human doctors.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

However, although there is less missing information, longer answers also mean an increased risk of introducing incorrect content. The proportion of incorrect content in Med-PaLM’s answers It reached 18.7%, the highest among the three.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Taking into account the possible harm of the answers, 29.7% of Flan-PaLM answers were considered to be potentially harmful; for Med-PaLM, this number dropped to 5.9%. Human doctors were the lowest at 5.7%.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

In addition to this, Med-PaLM outperformed human doctors on bias in medical demographics, with the only instances of bias in Med-PaLM’s answers. There was 0.8%, compared to 1.4% for human doctors and 7.9% for Flan-PaLM.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Finally, the researchers also invited five non-professional users to evaluate the practicality of these three sets of answers. Only 60.6% of Flan-PaLM's answers were considered helpful, the number increased to 80.3% for Med-PaLM, and the highest was 91.1% for human doctors.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Summarizing all the above evaluations, it can be seen that the adjustment of instruction prompts has a significant effect on improving performance. Among 140 consumer medical problems, Med-PaLM’s performance almost caught up with to the level of human doctors.

The team behind

The research team of this paper comes from Google and DeepMind.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

After Google Health was exposed to large-scale layoffs and reorganization last year, this can be said to be their major launch in the medical field.

Even Jeff Dean, the head of Google AI, came out to stand and expressed his strong recommendation!

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Some people in the industry also praised after reading:

Clinical knowledge is a complex field, and there is often no obvious correct answer. And there needs to be a conversation with the patient.

This time Google DeepMind’s new model is a perfect application of LLM.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

#It is worth mentioning that another team just passed the USMLE some time ago.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

Counting further ahead, a wave of large models such as PubMed GPT, DRAGON, and Meta’s Galactica emerged this year, repeatedly setting new records in professional exams.

The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors

#Medical AI is so prosperous that it’s hard to imagine that it was a bad news last year. At that time, Google’s innovative business related to medical AI had never started.

In June last year, it was exposed by the American media BI that it was in crisis and had to undergo large-scale layoffs and reorganization. When the Google Health department was first established in November 2018, it was very prosperous.

It’s not just Google. The medical AI business of other well-known technology companies has also experienced restructuring and acquisitions.

After reading the large medical model released by Google DeepMind, are you optimistic about the development of medical AI?

Paper address: https://arxiv.org/abs/2212.13138

Reference link: https://twitter.com/vivnat/status/1607609299894947841

The above is the detailed content of The highest AI score in history! Google's large model sets a new record for U.S. medical license test questions, and the level of scientific knowledge is comparable to that of human doctors. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Will R.E.P.O. Have Crossplay?

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

WebStorm Mac version

Useful JavaScript development tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Atom editor mac version download

The most popular open source editor

Hot Topics

Where is the login entrance for gmail email?

7554

CakePHP Tutorial

1382

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers