


The largest protein language model to date has been released!
A year ago, DeepMind’s open source AlphaFold2 was launched in Nature and Science, overwhelming the biological and AI academic circles.
A year later, Meta came with ESMFold, which was an order of magnitude faster.
Not only is it fast, the model also has 15 billion parameters.
LeCun tweeted to praise this as a great new achievement by the Meta-FAIR protein team.
Co-author Zeming Lin revealed that the large model with 3 billion parameters was trained on 256 GPUs for 3 weeks, while ESMfold took 10 days on 128 GPUs. As for the 15 billion parameter version, it is still unclear.
He also said that the code will definitely be open sourced later, so stay tuned!
Big and fast!
Today, our protagonist is ESMFold, a model that directly predicts high-accuracy, end-to-end, atomic-level structure from individual protein sequences.
Paper address: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1
The benefits of 15 billion parameters Needless to say – today’s large models can be trained to predict the three-dimensional structure of proteins with atomic-sized accuracy.
In terms of accuracy, ESMFold is similar to AlphaFold2 and RoseTTAFold.
However, ESMFold’s inference speed is an order of magnitude faster than AlphaFold2!
It may be difficult to understand the speed comparison between the three by talking about the order of magnitude. Just look at the picture below to understand.
What’s the difference?
Although AlphaFold2 and RoseTTAFold have achieved breakthrough success in the problem of atomic resolution structure prediction, they also rely on the use of multiple sequence alignments (MSA) and similar protein structure templates for optimal performance.
In contrast, by leveraging the internal representation of the language model, ESMFold can generate corresponding structure predictions using only one sequence as input, thus greatly speeding up structure prediction.
The researchers found that ESMFold’s predictions for low-complexity sequences were comparable to current state-of-the-art models.
Moreover, the accuracy of structure prediction is closely related to the complexity of the language model. That is to say, when the language model can better understand the sequence, it can better understand the structure.
Currently, there are billions of protein sequences of unknown structure and function, many of which are derived from metagenomic sequencing.
Using ESMFold, researchers can fold a random sample of 1 million metagenomic sequences in just 6 hours.
A large proportion of these have high confidence and are unlike any known structure (have no records in the database).
Researchers believe that ESMFold can help understand protein structures that are beyond current understanding.
Additionally, because ESMFold’s predictions are an order of magnitude faster than existing models, researchers can use ESMFold to help fill rapidly growing protein sequence databases and slow progress. The gap between protein structure and function databases.
15 billion parameter protein language model
Next let’s talk about Meta’s new ESMFold in detail.
ESM-2 is a Transformer-based language model and uses an attention mechanism to learn the interaction patterns between pairs of amino acids in the input sequence.
Compared with the previous generation model ESM-1b, Meta has improved the model structure and training parameters, and added computing resources and data. At the same time, the addition of relative position embedding enables the model to be generalized to sequences of any length.
From the results, the ESM-2 model with 150 million parameters performed better than the ESM-1b model with 650 million parameters.
In addition, ESM-2 also surpasses other protein language models on the benchmark of structure prediction. This performance improvement is consistent with established patterns in the large language modeling field.
As the scale of ESM-2 increases, it can be observed that the accuracy of language modeling has greatly improved.
End-to-end single sequence structure prediction
A key difference between SMFold and AlphaFold2 is that ESMFold uses language model representation, which eliminates the need for explicit homology Sequences (in the form of MSA) are required as input.
ESMFold simplifies the Evoformer in AlphaFold2 by replacing the computationally expensive network module that handles MSA with a Transformer module that handles sequences. This simplification means that ESMFold is significantly faster than MSA-based models.
The output of the folded backbone is then processed by a structure module, which is responsible for outputting the final atomic-level structure and prediction confidence.
Researchers compared ESMFold with AlphaFold2 and RoseTTAFold on the CAMEO (April 2022 to June 2022) and CASP14 (May 2020) test sets.
When only a single sequence is given as input, ESMFold performs much better than Alphafold 2.
When using the complete pipeline, AlphaFold2 achieved 88.3 and 84.7 on CAMEO and CASP14 respectively. ESMFold achieves comparable accuracy to RoseTTAfold on CAMEO, with an average TM score of 82.0.
Conclusion
The researchers found that language models targeting unsupervised learning performed well on a large Trained on an evolutionarily diverse protein sequence database, it can predict protein structures with atomic-level resolution.
By expanding the parameters of the language model to 15B, the impact of scale on protein structure learning can be systematically studied.
We saw that the nonlinear curve of protein structure predictions is a function of model size, and observed a strong connection between how well a language model understands a sequence and its structure predictions.
The models of the ESM-2 series are the largest protein language models trained to date, with only an order of magnitude fewer parameters than the largest recently developed text models.
Moreover, ESM-2 is a very big improvement over the previous model. Even under 150M parameters, ESM-2 captures more accurately than the ESM-1 generation language model under 650 million parameters. Structure diagram.
Researchers said that the biggest driver of ESMFold performance is the language model. Because there is a strong connection between the perplexity of language models and the accuracy of structure predictions, they found that when ESM-2 can better understand protein sequences, it can achieve predictions comparable to current state-of-the-art models.
ESMFold has obtained accurate atomic resolution structure prediction, and the inference time is an order of magnitude faster than AlphaFold2.
In practice, the speed advantage is even greater. Because ESMFold does not need to search for evolutionarily related sequences to construct MSA.
Although there are faster ways to reduce search time, it may still be very long no matter how much it is reduced.
The benefits brought by the greatly shortened inference time are self-evident - the increase in speed will make it possible to map the structural space of large metagenomics sequence databases.
In addition to structure-based tools to identify distal homology and conservation, rapid and accurate structure prediction with ESMFold can also play an important role in the structural and functional analysis of large new sequence collections.
Obtaining millions of predicted structures in a limited time will help discover new understanding of the breadth and diversity of natural proteins and enable the discovery of completely new protein structures and protein functions.
Introduction to the author
The co-author of this article is Zeming Lin from Meta AI.
According to his personal homepage, Zeming studied for a PhD at New York University and worked as a research engineer (visiting) at Meta AI, mainly responsible for back-end infrastructure work.
He studied at the University of Virginia for both his bachelor's and master's degrees, where he and Yanjun Qi did research on machine learning applications, especially in protein structure prediction.
The areas of interest are deep learning, structure prediction, and information biology.
The above is the detailed content of Faster than 0! Meta launched a large protein model with 15 billion parameters to crush AlphaFold2. For more information, please follow other related articles on the PHP Chinese website!

但可能打不过公园里的老大爷?巴黎奥运会正在如火如荼地进行中,乒乓球项目备受关注。与此同时,机器人打乒乓球也取得了新突破。刚刚,DeepMind提出了第一个在竞技乒乓球比赛中达到人类业余选手水平的学习型机器人智能体。论文地址:https://arxiv.org/pdf/2408.03906DeepMind这个机器人打乒乓球什么水平呢?大概和人类业余选手不相上下:正手反手都会:对手采用多种打法,该机器人也能招架得住:接不同旋转的发球:不过,比赛激烈程度似乎不如公园老大爷对战。对机器人来说,乒乓球运动

近几年自然语言处理的进展很大程度上都来自于大规模语言模型,每次发布的新模型都将参数量、训练数据量推向新高,同时也会对现有基准排行进行一次屠榜!比如今年4月,Google发布5400亿参数的语言模型PaLM(Pathways Language Model)在语言和推理类的一系列测评中成功超越人类,尤其是在few-shot小样本学习场景下的优异性能,也让PaLM被认为是下一代语言模型的发展方向。同理,视觉语言模型其实也是大力出奇迹,可以通过提升模型的规模来提升性能。当然了,如果只是多任务的视觉语言模

2018年谷歌发布了BERT,一经面世便一举击败11个NLP任务的State-of-the-art(Sota)结果,成为了NLP界新的里程碑;BERT的结构如下图所示,左边是BERT模型预训练过程,右边是对于具体任务的微调过程。其中,微调阶段是后续用于一些下游任务的时候进行微调,例如:文本分类,词性标注,问答系统等,BERT无需调整结构就可以在不同的任务上进行微调。通过”预训练语言模型+下游任务微调”的任务设计,带来了强大的模型效果。从此,“预训练语言模型+下游任务微调”便成为了NLP领域主流训

编辑|萝卜皮自2021年发布强大的AlphaFold2以来,科学家们一直在使用蛋白质结构预测模型来绘制细胞内各种蛋白质结构的图谱、发现药物,并绘制每种已知蛋白质相互作用的「宇宙图」。就在刚刚,GoogleDeepMind发布了AlphaFold3模型,该模型能够对包括蛋白质、核酸、小分子、离子和修饰残基在内的复合物进行联合结构预测。AlphaFold3的准确性对比过去许多专用工具(蛋白质-配体相互作用、蛋白质-核酸相互作用、抗体-抗原预测)有显著提高。这表明,在单个统一的深度学习框架内,可以实现

随着语言模型扩展到前所未有的规模,对下游任务进行全面微调变得十分昂贵。为了解决这个问题,研究人员开始关注并采用PEFT方法。PEFT方法的主要思想是将微调的范围限制在一小部分参数上,以降低计算成本,同时仍能实现自然语言理解任务的最先进性能。通过这种方式,研究人员能够在保持高性能的同时,节省计算资源,为自然语言处理领域带来新的研究热点。RoSA是一种新的PEFT技术,通过在一组基准测试的实验中,发现在使用相同参数预算的情况下,RoSA表现出优于先前的低秩自适应(LoRA)和纯稀疏微调方法。本文将深

近日,DeepMind 的创始人 Demis Hassabis 作客 Lex Fridman 的播客节目,谈了许多有趣的观点。在访谈的一开头,Hassabis 就直言图灵测试已经过时,因为这是数十年提出来的一个基准,且图灵测试是根据人的行动与反应来作判断,这就容易出现类似前段时间谷歌一工程师称 AI 系统已有意识的“闹剧”:研究者与一个语言模型对话,将自己的感知映射在对模型的判断上,有失客观。从2015年成立至今,DeepMind在人工智能领域的发展给世界带来过一次又一次的惊喜:从游戏程序Al

2月25日消息,Meta在当地时间周五宣布,它将推出一种针对研究社区的基于人工智能(AI)的新型大型语言模型,与微软、谷歌等一众受到ChatGPT刺激的公司一同加入人工智能竞赛。Meta的LLaMA是“大型语言模型MetaAI”(LargeLanguageModelMetaAI)的缩写,它可以在非商业许可下提供给政府、社区和学术界的研究人员和实体工作者。该公司将提供底层代码供用户使用,因此用户可以自行调整模型,并将其用于与研究相关的用例。Meta表示,该模型对算力的要

2017 年 Transformer 横空出世,由谷歌在论文《Attention is all you need》中引入。这篇论文抛弃了以往深度学习任务里面使用到的 CNN 和 RNN。这一开创性的研究颠覆了以往序列建模和 RNN 划等号的思路,如今被广泛用于 NLP。大热的 GPT、BERT 等都是基于 Transformer 构建的。Transformer 自推出以来,研究者已经提出了许多变体。但大家对 Transformer 的描述似乎都是以口头形式、图形解释等方式介绍该架构。关于 Tra


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.
