Home > Article > Technology peripherals > Overcame the three major problems of "graph-based model" for the first time! HKU open source OpenGraph: zero-sample learning adapts to a variety of downstream tasks
Graph Learning technology has been widely used in various fields, including recommendation systems, social network analysis, citation networks, and transportation networks. This technology can efficiently mine and learn complex relational data, providing a powerful tool for data scientists and engineers. Through graph learning algorithms, we can better understand the associations and connections between data, thereby helping us discover the laws and patterns hidden behind the data. In practical applications, graph learning technology can help us build more accurate and
Graph Neural Networks (GNNs) using iterative message passing mechanisms to capture features in graph-structured data. Complex higher-order relationships have achieved remarkable results in various graph learning applications.
Usually, this kind of end-to-end graph neural network requires a large amount of high-quality annotated data to achieve better training results.
In recent years, some work has proposed the pre-training and fine-tuning (Pre-training and Fine-tuning) mode of graph models, using various self-supervised learning tasks on unlabeled graph data. First, pre-training is performed, and then fine-tuning is performed on a small amount of labeled data to combat the problem of insufficient supervision signals. The self-supervised learning tasks here include methods such as contrastive learning, mask reconstruction, and local and global mutual information maximization.
Although such pre-training methods are successful to a certain extent, they have certain limitations in generalization capabilities, especially in the distribution between pre-training and downstream tasks. When offset.
In recommendation systems, pre-trained models are trained based on early data, but user preferences and product popularity often change, which requires the model to be continuously updated to adapt to new information.
To address this challenge, recent research has proposed hint fine-tuning methods for graph models, so that pre-trained models can more effectively adapt to different downstream tasks and data.
Although the above research has promoted the generalization performance of graph neural network models, these models are based on an assumption: training data and test data have the same node set and feature space.
This greatly limits the application scope of the pre-trained graph model. Therefore, this article explores methods to further improve the generalization ability of graphical models.
We expect OpenGraph to capture common topological patterns and achieve zero-sample predictions on test data. This means that through the forward propagation process, features can be efficiently extracted and test graph data accurately predicted.
The training process of the model is carried out on completely different graph data, and no elements of the test graph, including nodes, edges, and feature vectors, are touched during the training phase.
To achieve this goal, this article needs to solve the following three challenges:
One significant difficulty in the zero-shot graph prediction task is that different graph data usually have completely different graph token sets. Specifically, node sets from different graphs often do not overlap, and different graph datasets often use completely different node features. This prevents the model from performing cross-dataset prediction tasks by learning parameters bound to the graph tokens of a specific dataset.
In the field of graph learning, there are often intricate dependencies between nodes, and the model needs to understand the local and global aspects of the nodes. Neighborhood relationships are comprehensively considered. When building a general graph model, an important task is to be able to efficiently model the relationships between nodes, which can enhance the model effect and scalability when processing large amounts of graph data.
Due to reasons such as privacy protection and data collection costs, it is widespread in many downstream fields of graph learning The problem of data scarcity makes the training of general graph models prone to lack of understanding of certain downstream fields and produce sub-optimal training results.
To address the above challenges, researchers from the University of Hong Kong proposed OpenGraph, a model that is good at zero-shot learning and can identify transferable topological structural patterns between different downstream domains.
Paper link: https://arxiv.org/pdf/2403.01121.pdf
Source code Link: https://github.com/HKUDS/OpenGraph
Address challenge C1 by creating a graph tokenizer with a topology-aware projection scheme to generate unified graph tokens.
In order to cope with challenge C2, a scalable graph Transformer is designed, which is equipped with an efficient self-attention mechanism based on anchor sampling, and includes token sequence sampling to achieve more efficiency training.
To address challenge C3, we leverage large language models for data augmentation to enrich our pre-training, using hint tree algorithms and Gibbs sampling to simulate real-world graph structure relationships. data. Our extensive testing on multiple graph datasets shows OpenGraph's superior generalization capabilities in a variety of settings.
The overall structure of the model is shown in the figure below, which can be divided into three parts, namely 1) Unified diagram Tokenizer, 2) scalable graph Transformer, 3) large language model knowledge distillation.
By adopting a unified node representation space and a flexible sequence data structure, we hope to perform standardized and efficient tokenization for different graph data.
To achieve this goal, our tokenizer uses smoothed topological information and a mapping function from node space to latent representation space.
High-order smooth adjacency matrix In the graph tokenization process, the high-order power of the adjacency matrix is used as one of the inputs. This method can not only obtain the high-order connection relationship of the graph structure, but also solve the problem of connection sparsity in the original adjacency matrix. . Laplacian normalization is performed during the calculation process, and all adjacency matrix powers of different orders are taken into account. The specific calculation method is as follows. Topology-aware mapping of arbitrary graphs for different datasets There is a huge difference in the dimensions of the adjacency matrix, which prevents us from directly taking the adjacency matrix as input and then using a neural network with a fixed input dimension for processing. Our solution is to first project the adjacency matrix into the form of a node representation sequence, and then use a variable-length sequence model for processing. In order to reduce the information loss in the mapping process, we propose a topology-aware mapping method. First of all, the value range of our topology-aware mapping is a higher-dimensional latent representation space. Some previous works have pointed out that even random mappings can often produce satisfactory representations when adopting larger latent space dimensions. In order to further preserve the graph structure information and reduce the impact of randomness, we use fast eigenvalue decomposition (SVD) to construct our mapping function. In actual experiments, two rounds of fast eigenvalue decomposition can effectively retain topological information, and the resulting computational overhead is negligible compared to other modules. Parameterless graph tokenization In the process, OpenGraph assigns a unified topology-aware graph token representation to graph data with different characteristics. The next task is to model the complex dependencies between nodes using trainable neural networks. OpenGraph adopts the transformer architecture to take advantage of its powerful capabilities in modeling complex relationships. In order to ensure model efficiency and performance, we introduce the following two sampling techniques. Token sequence sampling Since our graph token sequence data generally has a large number of tokens and latent representations Dimension, the graph transformer used by OpenGraph samples the input token sequence and only learns the pairwise relationships between tokens in the current training batch, reducing the number of relationship pairs that need to be modeled from the square of the number of nodes to the square of the training batch size. , thereby greatly reducing the time and space overhead of the graph transformer in the training phase. Moreover, this sampling method allows the model to pay more attention to the current training batch when training. Although the input data is sampled, since our initial graph token representation contains the topological relationship between nodes, the sampled token sequence can still reflect the characteristics of all nodes in the entire graph to a certain extent. information. Anchor sampling method in self-attention Although token sequence sampling reduces the complexity from the square of the number of nodes It is reduced to the square of the batch size, but the square-level complexity has a greater limit on the batch size, making it impossible to use larger batches for model training, thus affecting the overall training time and training stability. In order to alleviate this problem, the transformer part of OpenGraph gives up modeling the pairwise relationships between all tokens, but instead samples some anchor points and splits the relationship learning between all nodes into two Relationship learning between nodes and anchors. Due to data privacy and other reasons, Obtaining data from various domains to train general graph models is challenging. Feeling the amazing knowledge and understanding capabilities demonstrated by large language models (LLM), we leverage its power to generate various graph-structured data for the training of general graph models. The data augmentation mechanism we designed enables LLM-enhanced graph data to better approximate the characteristics of real-world graphs, thereby improving the relevance and usefulness of the augmented data. LLM-based node generation When generating a graph, our initial step is to create a graph that is suitable for a specific application The node set of the scene. Each node has a text-based feature description that facilitates the subsequent edge generation process. However, this task can be particularly challenging when dealing with real-world scenarios because of the large size of the node set. For example, on an e-commerce platform, graph data may contain billions of products. Therefore, efficiently enabling LLM to generate a large number of nodes becomes a significant challenge. To address the above challenges, we adopt a strategy of continuously dividing general nodes into more fine-grained subcategories. For example, when generating product nodes in an e-commerce scenario, first use a query prompt LLM similar to "List the subcategories of all products on e-commerce platforms such as Taobao." LLM answered with a list of subcategories such as "Clothes," "Home Kitchenware," and "Electronic Products." We then ask LLM to further refine each subcategory to continue this iterative splitting process. This process is repeated until we have nodes that resemble real-world instances, such as a node with the labels "clothing", "women's clothing", "sweater", "pocketed sweater" and "white pocketed sweater" The product. Prompt Tree Algorithm The process of splitting nodes into subcategories and generating fine-grained entities follows a tree-like structure. The initial general nodes (e.g. “product”, “deep learning paper”) serve as roots and fine-grained entities serve as leaf nodes. We adopt a tree hinting strategy to traverse and generate these nodes. Edge generation based on LLM and Gibbs sampling For To generate edges, we use the Gibbs sampling algorithm with the node set generated above. The algorithm starts from a random sample and iterates. Each time, based on the current sample, a sample obtained by changing one of the data dimensions is sampled. The key to this algorithm is to estimate the conditional probability of a certain data dimension changing under the conditions of the current sample. We propose to perform probability estimation by LLM based on the text features obtained when nodes are generated. Since the edge collection space is large, in order to avoid the huge overhead caused by LLM exploring it, we first use LLM to characterize the node collection, and then use a simple similarity operator based on the representation vector. Compute relationships between nodes. Within the above edge generation framework, we also adopt the following three important techniques for adjustment. Dynamic probability normalization Because the similarity of LLM representation may be different from the [0, 1] range Huge, in order to obtain a probability value more suitable for sampling, we use a dynamic probability normalization method. This method dynamically maintains the recent T' similarity estimate values during the sampling process, calculates their mean and standard deviation, and finally maps the current similarity estimate to the two values above and below the mean. within the distribution range of one standard deviation, thereby obtaining a probability estimate approximately [0, 1]. Introducing node locality The edge generation method based on LLM can effectively based on the semantic similarity of nodes , determine their potential connection relationships. However, it tends to create too many connections between all semantically related nodes, ignoring the important concept of locality in real-world graphs. In the real world, nodes are more likely to be connected to a subset of related nodes because they typically only have limited interaction with a subset of nodes. To model this important property, a method that takes locality into account during edge generation is introduced. Each node is randomly assigned a locality index. The interaction probability between two nodes is affected by the attenuation of the absolute difference of the locality index. The greater the difference in the locality index of the node. , the more serious the attenuation is. Inject graph topology pattern In order to make the generated graph data more consistent with the topological structure pattern, we inject A modified node representation is generated again during a graph generation process. This node representation is obtained on the initial generated graph using a simple graph convolution network. It can better conform to the distribution characteristics of graph-structured data and avoid the gap between graph and text space. distribution shift. Finally, based on the revised node representation, we perform graph sampling again to obtain the final graph structure data. #In the experiment, we only use the generated data set based on LLM for OpenGraph model training, and the test data set They are all real data sets in various application scenarios, and include two types of tasks: node classification and link prediction. The specific settings of the experiment are as follows: 0-shot settings In order to verify the OpenGraph For zero-sample prediction capabilities, OpenGraph is tested on the generated training data set, and then uses a completely different real test data set to test the effect. There is no overlap between the training data set and the test data set in terms of nodes, edges, features, and labels. Few-shot settings Since most existing methods cannot perform efficient For zero-sample prediction, we test them using few-sample prediction. Baseline methods can be pre-trained on pre-training data and then trained, fine-tuned, or hinted on fine-tuned using k-shot samples. The test effect on 2 tasks and a total of 8 test data sets is as follows .
It can be observed: 1) across data sets In this case, OpenGraph's zero-sample prediction effect has greater advantages than existing methods. 2) In the case of cross-dataset transfer, existing pre-training methods are sometimes even worse than their base models trained from scratch on only a few samples, which reflects the fact that Figure Difficulty in achieving model generalization across data sets. Study on graph tokenizer ##Next we explore the impact of graph tokenizer design on the effect . First we adjusted the adjacency matrix smoothing end and tested its impact on the effect. The effect is severely attenuated at order 0, indicating the importance of using higher order smoothing. Secondly, we replace the topology-aware mapping function with other simple methods, including learnable one-hot id representations across datasets, random mapping, Learnable representations based on node degree. The results show that the three alternatives are all less effective. Among them, learning id representation across data sets has the worst effect. The effect of degree representation commonly used in existing work also attenuates greatly and cannot be learned. 's random mapping performs best among all alternatives, but is still far from our topology-aware mapping. In order to verify the knowledge based on LLM Effectiveness of the distillation method, we use different pre-training data sets to train OpenGraph and test its effect on different test sets. The pre-training data sets compared in this experiment include a version that removes a certain trick in our generation method alone, two real data sets Yelp2018 and Gowalla that are not related to the test data set, and ML-10M data set related to the test data set, it can be seen from the results: 1) Overall, our generated data set is able to produce better results on all test data Good results. 2) The three generation techniques tested all have a significant improvement effect. 3) Using real data sets (Yelp, Gowalla) for training may have negative effects, which may stem from the distribution differences between different real data sets. 4) ML-10M achieved the best results on both ML-1M and ML-10M, which shows that using similar training data sets can produce better results. Research on sampling techniques in Transformer This experiment performed an ablation test on the token sequence sampling (Seq) and anchor sampling (Anc) in our graph transformer module. The results show that the two sampling methods can optimize the space and time overhead of the model during the training and testing processes. In terms of effect, token sequence sampling has a positive effect on the model effect, while the results on the ddi data set show that anchor dish sampling has a negative effect on the model effect. The main focus of this study is to develop a highly adaptable framework that can accurately capture and understand a variety of Complex topological patterns in graph structures. By leveraging the potential of the proposed model, we aim to significantly improve the model’s generalization ability in zero-shot graph learning tasks including various downstream applications. Outstanding. To further improve the efficiency and robustness of OpenGraph, we built our model based on a scalable graph transformer architecture and an LLM-based data augmentation mechanism. We verify the model’s outstanding generalization capabilities through extensive experiments on multiple benchmark datasets. This study makes a preliminary exploration attempt in the direction of the graph base model. In future work, we plan to empower our framework to automatically discover noisy connections and structures with counterfactual learning influence while learning to be general and transferable across a variety of graphs structural pattern.
Scalable graph Transformer
Knowledge distillation of large language model
Experimental verification
Overall effect comparison
Study on pre-training data set
Conclusion
The above is the detailed content of Overcame the three major problems of "graph-based model" for the first time! HKU open source OpenGraph: zero-sample learning adapts to a variety of downstream tasks. For more information, please follow other related articles on the PHP Chinese website!