Home  >  Article  >  Technology peripherals  >  Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

WBOY
WBOYOriginal
2024-06-10 22:21:12520browse

Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

Editor | Luoboxin

Pre-trained language models have shown good promise in analyzing nucleotide sequences, but using a single pre-trained weight set performs well on different tasks There are still challenges for multifunctional models that perform well in .

The Baidu Big Data Lab (BDL) and Shanghai Jiao Tong University teams developed RNAErnie, an RNA-centered pre-training model based on the Transformer architecture.

The researchers evaluated the model on seven datasets and five tasks, demonstrating RNAErnie’s superiority in both supervised and unsupervised learning.

RNAErnie surpasses the baseline by improving classification accuracy by 1.8%, interaction prediction accuracy by 2.2%, and structure prediction F1 score by 3.3%, demonstrating its robustness and adaptability.

The study is titled "Multi-purpose RNA language modeling with motif-aware pretraining and type-guided fine-tuning" and was published on May 13, 2024 in "Nature Machine Intelligence》.

Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

#RNA plays a key role in the central dogma of molecular biology, responsible for transmitting genetic information in DNA to proteins.

RNA molecules play a vital role in a variety of cellular processes including gene expression, regulation and catalysis. Given the importance of RNA in biological systems, there is a growing need for efficient and accurate analysis methods for RNA sequences.

Traditional RNA-seq analysis relies on experimental techniques such as RNA sequencing and microarrays, but these methods are often costly, time-consuming, and require large amounts of RNA input.

To address these challenges, the Baidu BDL and Shanghai Jiao Tong University teams developed a pre-trained RNA language model: RNAErnie.

RNAErnie

The model is built on the Enhanced Representation of Knowledge Integration (ERNIE) framework and contains multi-layer and multi-head Transformer blocks, with hidden states for each Transformer block Dimension is 768. Pretraining is performed using an extensive corpus consisting of approximately 23 million RNA sequences carefully selected from RNAcentral.

The proposed motif-aware pre-training strategy involves base-level masking, sub-sequence-level masking and motif-level random masking, which effectively captures sub-sequence and motif-level knowledge and enriches the representation of RNA sequences. .

Additionally, RNAErnie tags coarse-grained RNA types as special vocabularies and appends the tags of coarse-grained RNA types to the end of each RNA sequence during pre-training. By doing so, the model has the potential to discern unique features of various RNA types, thereby facilitating domain adaptation to various downstream tasks.

Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

Illustration: Model overview. (Source: paper)

Specifically, the RNAErnie model consists of 12 Transformer layers. In the topic-aware pre-training stage, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database, using self-supervised learning and topic-aware multi-level random masks.

Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

Illustration: Topic-aware pre-training and type-guided fine-tuning strategy. (Source: paper)

In the type-guided fine-tuning stage, RNAErnie first uses the output embeddings to predict possible coarse-grained RNA types, and then uses the predicted types as auxiliary information to fine-tune the model through task-specific headers.

This approach enables the model to adapt to various RNA types and enhances its utility in a wide range of RNA analysis tasks.

More specifically, to adapt to distribution changes between the pre-trained dataset and the target domain, RNAErnie leverages domain adaptation to combine the pre-trained backbone with downstream modules in three neural architectures: with trainable Frozen Backbone with Trainable Heads (FBTH), Trainable Backbone with Trainable Heads (TBTH), and Stacking for Type-Guided Fine-Tuning (STACK).

In this way, the proposed method can optimize the trunk and task-specific headers end-to-end, or use embeddings extracted from the frozen trunk to fine-tune task-specific headers, depending on the downstream application .

Performance Evaluation

Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

Illustration: RNAErnie captures multi-level ontology patterns. (Source: paper)

The researchers evaluated the method and the results showed that RNAErnie outperformed seven RNA sequence data sets covering more than 17,000 major RNA motifs, 20 RNA types, and 50,000 RNA sequences. based on existing advanced technology.

Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal

Illustration: RNAErnie performance on RNA secondary structure prediction task using ArchiveII600 and TS0 datasets. (Source: Paper)

Evaluated using 30 mainstream RNA sequencing technologies, RNAErnie’s generalization and robustness are demonstrated. The team used accuracy, precision, recall, F1 score, MCC, and AUC as evaluation metrics to ensure a fair comparison of RNA-seq analysis methods.

Currently, there are few studies on applying the Transformer architecture with enhanced external knowledge to RNA-seq data analysis. The from-scratch RNAErnie framework integrates RNA sequence embedding and self-supervised learning strategies to bring superior performance, interpretability, and generalization potential to downstream RNA tasks.

Additionally, RNAErnie can be adapted to other tasks by modifying outputs and monitoring signals. RNAErnie is publicly available and is an efficient tool for understanding type-guided RNA analysis and advanced applications.

Limitations

Although the RNAErnie model is innovative in RNA sequence analysis, it still faces some challenges.

First, the model is limited by the size of the RNA sequences it can analyze, as sequences longer than 512 nucleotides are discarded, potentially overlooking important structural and functional information. Blocking methods developed to handle longer sequences may result in further loss of information about long-range interactions.

Second, the focus of this study is narrow, focusing only on RNA domains and not extending to tasks such as RNA protein prediction or binding site identification. Additionally, the model encounters difficulty in accounting for RNA's three-dimensional structural motifs, such as loops and junctions, which are critical to understanding RNA function.

More importantly, existing post-hoc architecture designs also have potential limitations.

Conclusion

Nonetheless, RNAErnie has great potential to advance RNA analysis. The model demonstrates its versatility and effectiveness as a general solution in different downstream tasks.

In addition, the innovative strategies adopted by RNAErnie are expected to enhance the performance of other pre-trained models in RNA analysis. These findings make RNAErnie a valuable asset, providing researchers with a powerful tool to unravel the complexities of RNA-related research.

Paper link:https://www.nature.com/articles/s42256-024-00836-4

The above is the detailed content of Multifunctional RNA analysis, Baidu team’s RNA language model based on Transformer is published in Nature sub-journal. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn