Home >Technology peripherals >AI >Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

王林
王林Original
2024-06-26 20:40:111001browse

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

Editor | Radish Skin

In the **long** three billion years of natural evolution, the **form** of the **existing** proteins was formed and went through a long natural selection process. Evolution is like a parallel experiment conducted on geological time scales, through random mutation and selection mechanisms, sifting according to the sequence, structure and function of proteins.

, Here, researchers at EvolutionaryScale show that language models trained on evolution-generated markers can serve as evolutionary simulators for generating functionality that differs from known protein sequences **protein.

, Researchers propose **cutting-edge** ESM3, an **advanced** multimodal generative language model that can reason about protein sequence, structure, and function. ESM3 can combine its modalities to follow complex cues and is highly sensitive to biological alignment.

Researchers use ESM3 to generate **high performance** fluorescent proteins. One of the most **efficient** fluorescent proteins has a very different sequence (58% homology) from known fluorescent proteins.

The preprint article of this research "Simulating 500 million years of evolution with a language model" will be published on the bioRxiv preprint platform in the near future.

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

How did natural evolution carve out the current diversity of proteins in nature over more than three billion years?

This process involves **many** random mutations and natural selection events. Each link is a **strict** test of the sequence, structure and biological function of the protein. Only the most **fit** environment Only the changed proteins can be retained.

Therefore, existing protein sequence information essentially contains the impact of biological variables on the long evolutionary path of billions of years.

The EvolutionaryScale team has proposed an innovative approach that can simulate this grand evolutionary process using a multimodal generative language model called ESM3.

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.
Video link: https://www.php.cn/link/4b816bc18d998441c4cbc6058277c844
Video: ESM3 Overview. (Source: Company official website)

ESM3 can not only understand and generate protein sequences, but also comprehensively consider the structure and function of proteins, becoming a powerful evolutionary simulation tool. This model is designed with a unique geometric attention mechanism that can efficiently process the three-dimensional structural information of proteins, which is crucial for understanding and predicting protein behavior.

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

Illustration: ESM3 can infer protein sequence, structure and function simultaneously. (Source: paper)

Language models operate on discrete units or tokens. To create a model that can reason about the three basic biological properties of a protein (sequence, structure, and function), researchers must convert three-dimensional structure and function into a discrete alphabet and build a way to write each three-dimensional structure as a sequence of letters. method.

This enables ESM3 to train at scale, unlocking emerging generative capabilities. ESM3's vocabulary integrates sequence, structure, and function into the same language model.

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

Illustration: ESM3 designed a scaffold for the PETase active site through multimodal cues of sequence, structure, and function. (Source: paper)

The training goal of ESM3 is simple. For each protein, its sequence, structure, and function were extracted, labeled, and partially masked. ESM3 is tasked with predicting masking locations using a masking language modeling objective inspired by natural language processing models.

To accomplish this task, ESM3 must learn to deeply understand the connections between sequence, structure, and function in evolutionary scale data. ESM3 learns to simulate evolution when scaling to billions of proteins and billions of parameters.

ESM3 is capable of generating functional proteins that are different from existing known protein sequences. This model is characterized by its ability to understand and respond to complex multimodal cues while being highly sensitive to biological alignment.

ESM3 is highly sensitive to biological alignment, meaning it can accurately identify and follow patterns related to biological evolution and function. Through this alignment, models can better understand how proteins evolve based on their biological roles and environmental demands, thereby more accurately reflecting nature's biological logic and evolutionary constraints when designing new proteins.

It can generate new proteins according to the prompts. ESM3's multimodal inference capabilities enable scientists to generate new proteins with an unprecedented degree of control. For example, models can be prompted to combine structure, sequence, and function to propose potential scaffolds for the active site of PETase, an enzyme that degrades polyethylene terephthalate (PET), a protein engineer that breaks down plastic waste research objectives.

Solving harder generation problems

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

Illustration: The ESM3 model is evaluated on the task of generating proteins that satisfy atomic coordination cues. (Source: Paper)

ESM3’s ability to solve challenging protein design tasks becomes apparent as protein scale increases. One such task is atomic coordination, the design of proteins based on cues that specify the positions of amino acid atoms that are distant in sequence but closer in structure.

This measures a model’s ability to achieve atomic-level accuracy in structure generation, which is critical for designing functional proteins. ESM3's ability to solve these tasks increases with scale, that is, ESM3 solves harder generation problems as a function of scale.

ESM3 is further improved with feedback by using an alignment method similar to Reinforcement Learning with Human Feedback (RLHF) applied in LLM. Instead of receiving feedback from humans, ESM3 can improve itself, providing feedback on the quality of its own generation. Feedback from wet lab experiments or existing experimental data can also be used to align the generation of ESM3 with biology.

Spanning 500 million years of natural evolutionary distance

Researchers used ESM3 to design a new fluorescent protein called esmGFP, which has only 58% sequence homology with the most similar known fluorescent protein, which is It was extremely rare in previous artificial designs.

By directing ESM3 to focus on the sequence and structural features necessary for fluorescent protein formation and catalyzing chromosome reactions, the model was designed through a series of iterations, ultimately resulting in esmGFP with bright fluorescent effects.

Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.

Illustration: esmGFP compared with known fluorescent proteins. (Source: paper)

This protein is not only significantly different from known proteins in sequence, but also exhibits similar fluorescence intensity to common fluorescent proteins in experiments. This equates to a natural evolutionary distance spanning more than 500 million years.

EvolutionaryScale is a non-profit company. Their mission is to develop artificial intelligence to understand biology to benefit human health and society through collaboration with the scientific community and open, safe and responsible research. Since its inception, the ESM project has been committed to open science through code and model releases, and the team is committed to continuing to do so.

The company was founded in July 2023 and has completed a US$142 million seed round of financing and has reached cooperation with Amazon and NVIDIA.

ESM related code: https://github.com/evolutionaryscale/esm
Paper link: https://evolutionaryscale-public.s3.us-east-2.amazonaws.com/research/esm3.pdf
Related reports:
https://www.evolutionaryscale.ai/blog/esm3-release
https://twitter.com/ylecun/status/1805634811773571496
https://twitter .com/ylecun/status/1805581310548697360
https://x.com/ebetica/status/1805599844246884677
https://www.businesswire.com/news/home/20240625717839/ en/

The above is the detailed content of Simulating 500 million years of evolutionary information, it is the first large-scale biological model to simultaneously infer protein sequence, structure and function.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Related articles

See more