introduce MolE, a transformer-based model for molecular graph learning. MolE directly works with molecular graphs by providing both atom identifiers and graph connectivity as input tokens. Atom identifiers are calculated by hashing different atomic properties into a single integer, and graph connectivity is given as a topological distance matrix. MolE uses a Transformer as its base architecture, which also has been applied to graphs previously. The performance of transformers can be attributed in large part to the extensive use of the self-attention mechanism. In standard transformers, the input tokens are embedded into queries, keys and values (Q,K,Vin {R}^{Ntimes d}), which are used to compute self-attention as:
MolE is a transformer model designed specifically for molecular graphs. It directly works with graphs by providing both atom identifiers and graph connectivity as input tokens and relative position information, respectively. Atom identifiers are calculated by hashing different atomic properties into a single integer. In particular, this hash contains the following information:
- number of neighboring heavy atoms,
- number of neighboring hydrogen atoms,
- valence minus the number of attached hydrogens,
- atomic charge,
- atomic mass,
- attached bond types,
- and ring membership.
Atom identifiers (also known as atom environments of radius 0) were computed using the Morgan algorithm as implemented in RDKit.
In addition to tokens, MolE also takes graph connectivity information as input which is an important inductive bias since it encodes the relative position of atoms in the molecular graph. In this case, the graph connectivity is given as a topological distance matrix d where dij corresponds to the length of the shortest path over bonds separating atom i from atom j.
MolE uses a Transformer as its base architecture, which also has been applied to graphs previously. The performance of transformers can be attributed in large part to the extensive use of the self-attention mechanism. In standard transformers, the input tokens are embedded into queries, keys and values (Q,K,Vin {R}^{Ntimes d}), which are used to compute self-attention as:
where ({H}_{0}in {R}^{Ntimes d}) are the output hidden vectors after self-attention, and (d) is the dimension of the hidden space.
In order to explicitly carry positional information through each layer of the transformer, MolE uses the disentangled self-attention from DeBERTa:
where ({Q}^{c},{K}^{c},{V}^{c}in {R}^{Ntimes d}) are context queries, keys and values that contain token information (used in standard self-attention), and ({Q}_{i,j}^{p},{K}_{i,j}^{p}in {R}^{Ntimes d}) are the position queries and keys that encode the relative position of the (i{{{rm{th}}}}) atom with respect to the (j{{{rm{th}}}}) atom. The use of disentangled attention makes MolE invariant with respect to the order of the input atoms.
As mentioned earlier, self-supervised pretraining can effectively transfer information from large unlabeled datasets to smaller datasets with labels. Here we present a two-step pretraining strategy. The first step is a self-supervised approach to learn chemical structure representation. For this we use a BERT-like approach in which each atom is randomly masked with a probability of 15%, from which 80% of the selected tokens are replaced by a mask token, 10% replaced by a random token from the vocabulary, and 10% are not changed. Different from BERT, the prediction task is not to predict the identity of the masked token, but to predict the corresponding atom environment (or functional atom environment) of radius 2, meaning all atoms that are separated from the masked atom by two or less bonds. It is important to keep in mind that we used different tokenization strategies for inputs (radius 0) and labels (radius 2) and that input tokens do not contain overlapping data of neighboring atoms to avoid information leakage. This incentivizes the model to aggregate information from neighboring atoms while learning local molecular features. MolE learns via a classification task where each atom environment of radius 2 has a predefined label, contrary to the Context Prediction approach where the task is to match the embedding of atom environments of radius 4 to the embedding of context atoms (i.e., surrounding atoms beyond radius 4) via negative sampling. The second step uses a graph-level supervised pretraining with a large labeled dataset. As proposed by Hu et al., combining node- and graph-level pretraining helps to learn local and global features that improve the final prediction performance. More details regarding the pretraining steps can be found in the Methods section.
MolE was pretrained using an ultra-large database of ~842 million molecules from ZINC and ExCAPE-DB, employing a self-supervised scheme (with an auxiliary loss) followed by a supervised pretraining with ~456K molecules (see Methods section for more details). We assess the quality of the molecular embedding by finetuning MolE on a set of downstream tasks. In this case, we use a set of 22 ADMET tasks included in the Therapeutic Data Commons (TDC) benchmark This benchmark is composed of 9 regression and 13 binary classification tasks on datasets that range from hundreds (e.g, DILI with 475 compounds) to thousands of compounds (such as CYP inhibition tasks with ~13,000 compounds). An advantage of using this benchmark is
The above is the detailed content of MolE: A Transformer Model for Molecular Graph Learning. For more information, please follow other related articles on the PHP Chinese website!