Home  >  Article  >  Technology peripherals  >  Meta lets a 15 billion parameter language model learn to design "new" proteins from scratch! LeCun: Amazing results

Meta lets a 15 billion parameter language model learn to design "new" proteins from scratch! LeCun: Amazing results

王林
王林forward
2023-04-13 19:22:011452browse

AI has once again made new progress in the field of biomedicine. Yes, this time it’s about protein.

The difference is that in the past, AI discovered protein structures, but this time it began to design and generate protein structures on its own. If he was a "Prosecutor" in the past, it is not impossible to say that he has evolved into a "Creator" now.

# Participants in this study are the protein research team of FAIR, which is part of Meta’s AI research institute. As the chief AI scientist who has worked at Facebook for many years, Yann LeCun also immediately forwarded the results of his team and spoke highly of it.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

These two papers on BioRxiv are Meta’s “amazing” achievements in protein design/generation. The system uses a simulated annealing algorithm to find an amino acid sequence that folds into a desired shape or satisfies constraints such as symmetry.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

ESM2, a model for predicting atomic hierarchical structure

You guessed it right, this study is the same as the two papers The basis is the large language model for protein prediction and discovery proposed by Meta not long ago: ESM2.

This is a large model with 15 billion parameters. As the model scales from 8 to 15 million parameters, information emerging from the internal representation enables three-dimensional structure predictions at atomic resolution.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

Utilizing large language models to learn evolutionary patterns, accurate structures can be generated end-to-end directly from protein sequences Predictions, while maintaining accuracy, are up to 60 times faster than current state-of-the-art methods.

#In fact, with the help of this new structural prediction capability, Meta was able to use a cluster of approximately 2,000 GPUs in just two weeks. The sequences of more than 600 million metagenomic proteins in the map were predicted.

The corresponding author of the two papers, Alex Rives from Meta AI, said that the versatility demonstrated by the ESM2 language model not only extends beyond the scope of natural proteins, but also It also enables programmable generation of complex and modular protein structures.

Protein Design "Specialized Programming Language"

If a worker wants to do his job well, he must first sharpen his tools.

In order to make protein design and generation more efficient, researchers have also developed a protein-oriented Designed high-level programming language.

Paper address: https://www.biorxiv.org/content/10.1101/2022.12.21.521526v1

Alex Rives, one of the main leaders of the research and the corresponding author of the paper "A high-level programming language for generative protein design", said on social media that this result makes the system complex and modular. Programming the generation of structures of large proteins and complexes becomes possible.

Brian Hie, one of the authors of the paper and a researcher at Stanford University, also explained the main research ideas and results of this article on Twitter. .

Overall, this article describes how generative machine learning enables the modular design of complex proteins controlled by high-level programming languages ​​for protein design. .

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

He stated that the main idea of ​​the article is not to use building blocks of sequences or structures, but to place modularity at a higher level of abstraction and Let black box optimization generate specific designs. Atomic-level structure is predicted at every step of the optimization.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

Compared with previous protein design methods, this new idea generates a method that allows designers to specify arbitrary , non-differentiable constraints, ranging from specifying atomic-level coordinates to abstract design solutions for proteins, such as symmetry designs.

#It is important for programmability that constraints are modular. For example, the figure below shows the situation where the same constraint is applied hierarchically to two levels of symmetry programming.

#These constraints are also easy to recombine. For example, constraints on atomic coordinates can be combined with constraints on symmetry. Or different forms of two-level symmetry can be combined to program an asymmetric composite structure.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

Brian Hie believes that this result is towards a more controllable, regular and expressive A step forward in protein design. He also thanked Meta AI and other collaborators for their joint efforts.

Make protein design "like building a building"

In the paper, the researchers argue that protein design would benefit from a basic set of Regularity, simplicity, and programmability are provided by abstract concepts like those used in the engineering of buildings, machines, circuits, and computer software.

#But unlike these artificial creations, proteins cannot be broken down into easily recombined parts because the local structure of the sequence is entangled with its overall environment. Together. Classic ab initio protein design attempts to identify a set of basic structural building blocks and then assemble them into higher-order structures.

#Similarly, traditional protein engineering often recombines fragments or domains of native protein sequences into hybrid chimeras. However, existing approaches are not yet able to achieve the high combinatorial complexity required for true programmability.

This paper demonstrates that modern generative models achieve the classic goals of modularity and programmability at new levels of combinatorial complexity. Putting modularity and programmability at a higher level of abstraction, generative models bridge the gap between human intuition and the generation of specific sequences and structures.

In this case, the protein designer only needs to reassemble the high-level instructions, and the task of obtaining a protein that satisfies these instructions is placed on the generative model superior.

# Researchers propose a programming language for generative protein design that allows designers to specify intuitive, modular, and hierarchical procedures. High-level programs can be transformed into low-level sequences and structures through generative models. This approach leverages advances in protein language models, which can learn structural information and design principles for proteins.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

The specific implementation in this study is based on an energy-based generative model, as shown in the figure above.

First, a protein designer specifies a high-level program consisting of a set of hierarchically organized constraints (Figure A).

The program then compiles into an energy function that evaluates compatibility with constraints, which can be arbitrary and indistinguishable (Figure B ).

#Apply structural constraints by incorporating atomic-level structure predictions (enabled by language models) into the energy function. This approach is capable of generating a wide range of complex designs (Figure C).

Generating protein sequences from scratch

In the paper "Language models generalize beyond natural proteins", Tom Sercu, the author from the MetaAI team, said that this The work mainly accomplished two tasks.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

Paper address: https://www.biorxiv.org/content/10.1101/2022.12.21.521521v1

The first item is to design the sequence for the given main chain structure. Using a language model, a successful design for all goals can be obtained with a success rate of 19/20, while a sequence design without the participation of a language model has a success rate of only 1/20.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

The second task is unconstrained generation. The research team proposes a new method for sampling (sequence, structure) pairs from an energy landscape defined by a language model.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

Sampling through different topologies again increases the success rate of the experiment (up to 71/129 or 55%) .

To prove that the predicted protein structure exceeds the limitations of natural proteins, the research team compared the language model generated in a sequence database covering all known natural proteins. Protein sequence search.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

The results show that there is no matching relationship between the two, and the prediction structures generated by natural sequences and language models are different.

Sercu said that protein structures can be designed using the ESM2 protein language model alone. The research team tested 228 proteins experimentally, with a success rate of 67%!

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

Sercu believes that protein language models trained only on sequences can learn deep patterns connecting sequence and structure, and Can be used to design proteins de novo, beyond the design space naturally explored.

Exploring the deep grammar of protein generation

In the paper, Meta researchers stated that although the language model is only trained on sequences, The model can still design the deep grammatical structure of proteins and break through the limitations of natural proteins.

If the squares in Figure A represent the space composed of all protein sequences, then the natural protein sequence is the gray part, covering a small part of it. In order to generalize beyond natural sequences, language models need access to underlying design patterns.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

What the research team has to do is two things: first, design the protein (de novo) main chain from scratch; second, based on the main chain, start from scratch to generate protein sequences.

The research team used a masked language model to train ESM2, and the training content included millions of different natural proteins during the evolution process.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

After the language model is trained, information about the tertiary structure of the protein can be identified in the internal attention state of the model. Afterwards, the researchers converted the attention of a pair of positions in the protein sequence into a distribution of distances between residues through linear projection.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

The ability of language models to predict protein structures points to the deeper structures underlying natural protein sequences, the researchers said. sequence, and the possibility that there is a deep grammar that can be learned by the model.

The results show that during the evolution process, a large number of protein sequences contain biological structures and functions, revealing the design structure of proteins. This construction is entirely reproducible by learning machine models of protein sequences.

Meta lets a 15 billion parameter language model learn to design new proteins from scratch! LeCun: Amazing results

Protein structure successfully predicted by the language model in 6 experiments

The existence of deep grammars across proteins explains two seemingly contradictory sets of findings: that understanding native proteins depends on training data; and that language models can operate outside known native protein families. Predict and explore.

#If the scaling law of protein language models continues to be effective, it can be expected that the generation capabilities of AI language models will continue to improve.

The research team stated that due to the existence of the basic grammar of protein structure, the machine model will learn more rare protein structures, thereby expanding the model's prediction ability and exploration space.

# One year ago, DeepMind’s open source AlphaFold2 was launched in Nature and Science, overwhelming the biological and AI academic circles.

# One year later, artificial intelligence prediction models have sprung up, frequently filling gaps in the field of protein structure.

If humans give life to artificial intelligence, then is artificial intelligence the last piece of the puzzle for humans to complete the mystery of life?

The above is the detailed content of Meta lets a 15 billion parameter language model learn to design "new" proteins from scratch! LeCun: Amazing results. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete