Home >Technology peripherals >AI >Specific pre-trained models for the biomedical NLP domain: PubMedBERT
The rapid development of large language models this year has resulted in models like BERT now being called "small" models. In Kaggle's LLM science exam competition, players using deberta achieved fourth place, which is an excellent result. Therefore, in a specific domain or need, a large language model is not necessarily required as the best solution, and small models also have their place. Therefore, what we are going to introduce today is PubMedBERT, which is a paper published by Microsoft Research at ACM in 2022. This model pre-trains BERT from scratch by using domain-specific corpora
Here are the main takeaways from the paper:
For specific domains with large amounts of unlabeled text, such as the biomedical domain, pre-training from scratch Language models are more effective than continuous pre-training of general-domain language models. To this end, we propose the Biomedical Language Understanding and Reasoning Benchmark (BLURB) for domain-specific pre-training
# Research shows that domain-specific pretraining from scratch greatly outperforms continuous pretraining of general language models, thus demonstrating support for hybrid The prevailing assumptions of domain pretraining do not always apply.
Using the BERT model, for the masked language model (MLM), the requirement of whole word masking (WWM) is necessary Mask the entire word
According to the author, BLUE[45] is The first attempt to create an NLP benchmark in the biomedical field. But BLUE's coverage is limited. For biomedical applications based on pubmed, the author proposes the Biomedical Language Understanding and Reasoning Benchmark (BLURB).
PubMedBERT uses a larger domain-specific corpus (21GB).
In most biomedical natural In language processing (NLP) tasks, PubMedBERT consistently outperforms all other BERT models, often with clear advantages
The above is the detailed content of Specific pre-trained models for the biomedical NLP domain: PubMedBERT. For more information, please follow other related articles on the PHP Chinese website!