Home >Technology peripherals >AI >Specific pre-trained models for the biomedical NLP domain: PubMedBERT

Specific pre-trained models for the biomedical NLP domain: PubMedBERT

王林
王林forward
2023-11-27 17:13:461205browse

The rapid development of large language models this year has resulted in models like BERT now being called "small" models. In Kaggle's LLM science exam competition, players using deberta achieved fourth place, which is an excellent result. Therefore, in a specific domain or need, a large language model is not necessarily required as the best solution, and small models also have their place. Therefore, what we are going to introduce today is PubMedBERT, which is a paper published by Microsoft Research at ACM in 2022. This model pre-trains BERT from scratch by using domain-specific corpora

Specific pre-trained models for the biomedical NLP domain: PubMedBERT

Here are the main takeaways from the paper:

For specific domains with large amounts of unlabeled text, such as the biomedical domain, pre-training from scratch Language models are more effective than continuous pre-training of general-domain language models. To this end, we propose the Biomedical Language Understanding and Reasoning Benchmark (BLURB) for domain-specific pre-training

PubMedBERT

1 , Domain-specific Pretraining

Specific pre-trained models for the biomedical NLP domain: PubMedBERT

# Research shows that domain-specific pretraining from scratch greatly outperforms continuous pretraining of general language models, thus demonstrating support for hybrid The prevailing assumptions of domain pretraining do not always apply.

2. Model

Using the BERT model, for the masked language model (MLM), the requirement of whole word masking (WWM) is necessary Mask the entire word

3. BLURB data set

Specific pre-trained models for the biomedical NLP domain: PubMedBERT

According to the author, BLUE[45] is The first attempt to create an NLP benchmark in the biomedical field. But BLUE's coverage is limited. For biomedical applications based on pubmed, the author proposes the Biomedical Language Understanding and Reasoning Benchmark (BLURB).

Specific pre-trained models for the biomedical NLP domain: PubMedBERT

PubMedBERT uses a larger domain-specific corpus (21GB).

Specific pre-trained models for the biomedical NLP domain: PubMedBERT

Results displayed

Specific pre-trained models for the biomedical NLP domain: PubMedBERT

In most biomedical natural In language processing (NLP) tasks, PubMedBERT consistently outperforms all other BERT models, often with clear advantages

The above is the detailed content of Specific pre-trained models for the biomedical NLP domain: PubMedBERT. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete