Home >Technology peripherals >AI >Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation

Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal: 2024-07-03 10:44:16544browse

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The author team of this article comes from the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology. The author team consists of: Zheng Zihao, Zhang Zihan, Wang Zexin, Fu Rui Ji, Liu Ming, Wang Zhongyuan, Qin Bing.

Multimodal representation

Multimodal named entity recognition, as a basic and key task in building multimodal knowledge graphs, requires researchers to integrate multiple modal information to accurately Extract named entities from text. Although previous research has explored integration methods of multi-modal representations at different levels, they are still insufficient in fusing these multi-modal representations to provide rich contextual information and thereby improve the performance of multi-modal named entity recognition. .

In this paper, the research team proposes DPE-MNER, an innovative iterative reasoning framework that follows the "decompose, prioritize, eliminate" strategy and dynamically integrates diverse multi-modal representations. This framework cleverly decomposes the fusion of multimodal representations into hierarchical and interconnected fusion layers, greatly simplifying the processing process. When integrating multimodal information, the team placed special emphasis on progressive transitions from "simple to complex" and "macro to micro." In addition, by explicitly modeling cross-modal correlations, the research team effectively excludes irrelevant information that may mislead MNER predictions. Through extensive experiments on two public datasets, the research team's method has been proven to be significantly effective in improving the accuracy and efficiency of multi-modal named entity recognition. This article is one of the ten best paper candidates among 1558 accepted papers for LREC-COLING 2024.

哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

Paper link: https://www.php.cn/link/4b4984066015df12cfc4e8f6d60b7147

Motivation

哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

^{An example of multi-modal named entity recognition. The research team demonstrated a variety of multimodal representations that may be useful for named entity recognition decisions. Humans typically process this information mentally in an iterative manner.}

To address this problem, the research team drew inspiration from the field of Complex Problem Solving (Sternberg and Frensch, 1992). This field focuses on the study of methods and strategies used by humans and computers to solve problems involving multiple variables, uncertainty, and high complexity. First, they believe that when faced with complex problems, humans generally adopt an iterative approach. As shown in the figure, the research team actually uses an iterative process when dealing with MNER. Second, humans use specific strategies to simplify these problems, such as decomposing, prioritizing, and eliminating irrelevant factors.

The research team believes that treating multi-modal named entity recognition (MNER) as an iterative process of integrating multi-modal information and using these strategies is very suitable for MNER tasks. Compared with single-step methods, multi-step methods can more comprehensively exploit diverse multi-modal representations in the process of iteratively optimizing named entity recognition (NER) results.

In addition, these three strategies are very suitable for the integration of multiple representations in multi-modal NER:

The decomposition strategy encourages us to split the fusion of multi-modal representations into smaller, more Easily tractable units capable of exploring multimodal interactions at different levels of granularity.
The prioritization strategy recommends integrating multi-modal information according to the order of "easy to difficult" and "coarse to fine"; this progressive integration contributes to the step-by-step optimization of MNER predictions. This enables the model to gradually shift attention from simple but coarse information to complex but precise details.
The irrelevance elimination strategy inspires us to explicitly screen and exclude irrelevant information in different multi-modal representations; this can eliminate irrelevant information that may affect MNER performance.

Method

The research team designed an iterative multi-modal entity extraction framework that dynamically fuses multiple multi-modal features, which includes an iterative process and a prediction network.

哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

Iterative modeling MNER

The research team followed the diffusion model to model object recognition, visual alignment and text entity extraction as an iterative denoising process, and also used the diffusion model to combine multi-modal entities Extraction is modeled as an iterative process. The model first randomly initializes a series of entity intervals 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

, and uses a prediction network to encode multi-modal features to iteratively denoise during the denoising process to obtain the correct entity intervals 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

in the text.

Prediction Network

As shown in the figure, the research team obtained a total of three granular representations in the text 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

, two granularities and two difficulties in the picture (they believe that aligned Representations are simple representations, misaligned representations are difficult representations) 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

. The team's prediction network AMRN includes an encoding network (DMMF) and a decoding network (MER). The design of the prediction network is based on the three strategies mentioned earlier.

As shown in the figure, the encoding network is a hierarchical fusion network that fuses and decomposes multiple multi-modal features into a hierarchical process. The bottom-up process is to first integrate the image features of the same granularity and different difficulty into the text features $x_i$ of each granularity, then integrate the image features $Y$ of different granularities into the text features of each granularity, and finally integrate the different granularity features $Y$ into the text features 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

of each granularity. The image features 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

Y and text features X are fused to obtain the final multi-modal representation. Input to the decoding network for decoding, and the decoding network obtains new intervals and the entity type of each interval.

Underlying fusion. The research team at this level integrates image features of a certain granularity into text features of a certain granularity. According to the diffusion process, the research team can obtain a scheduler that can reflect the status of the current iteration, which is also the key to introducing

priority. Based on this scheduler, the research team fused image features of different difficulties together to obtain and 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

correlation 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

rel, which is used to eliminate irrelevant information. Finally, a bottleneck transformer is used based on this correlation to fuse and Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation

Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation

, and a multi-modal image and text fusion representation 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

of a certain granularity is obtained. 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

Mid layer fusion.The research team at this layer fuses image features of different granularities into text features of a certain granularity, that is, fusion Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation

. At this layer, we use a scheduler to dynamically fuse image features of different granularities to obtain a multi-modal text representation of a certain granularity 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

Top fusion. The research team at this layer fuses multi-modal text representation Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation

of different granularities into the interval representation to obtain a total multi-modal text representation 哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

, which is input into the decoding network for prediction.

Experimental results

Main experiment

The author team compared some typical methods of MNER. Experimental results show that this method achieves the best performance on two commonly used data sets.

Ablation experiment

The researchers removed the prioritization, hierarchical, and elimination designs in our paper to observe the model performance. The results show that removing each design brings performance degradation.

Comparison with static feature fusion methods

They compared some typical static multi-modal fusion methods, such as max pooling, average pooling, MLP-based and MoE-based methods, The results show that their proposed dynamic fusion framework can achieve the best performance.

哈工大提出创新迭代推理框架 DPE-MNER ：充分发挥多模态表示潜力

Typical sample analysis

The research team selected two representative samples to illustrate the iterative process. It can be seen that in the first iteration step, the types of time square and cubes were incorrectly predicted; however, based on the important feature clues in the picture, it was iteratively corrected to the correct entity type.

Conclusion

This paper aims to fully utilize the potential of various multi-modal representations in the field of multi-modal named entity recognition (MNER), in order to obtain excellent recognition results. To this end, the authors designed and proposed an innovative iterative reasoning framework—DPE-MNER. DPE-MNER cleverly simplifies the integration process of these rich and diverse multi-modal representations by decomposing the MNER task into multiple stages. In this iterative process, multimodal representations achieve dynamic fusion and integration based on the strategy of “decomposition, prioritization, and elimination.” Through a series of rigorous experimental verifications, the research team fully demonstrated the remarkable effects and superior performance of the DPE-MNER framework.

^References:

^{[1] Knowledge Graphs Meet Multi-Modal Learning: Comprehensive Survey, arxiv}

^{[2] Decompose, Prioritize, and Eliminate: Dynamically In tegrating Diverse Representations for Multi-modal Named Entity Recognition,2024,Joint International Conference on Computational Linguistics, Language Resources and Evaluation}

^{[3] Complex problem solving : Principles and mechanisms,1992, American Journal of Psycholog}

^{[4] DiffusionNER: Boundary Diffusion for Named Entity Recognition, ACL23}

^{[5] DiffusionDet: Diffusion Model for Object Detection, ICCV23}

^{[6] Language-Guided Diffusion Model for Visual Grounding , arxiv23}

The above is the detailed content of Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation. For more information, please follow other related articles on the PHP Chinese website!

Object for transformer https

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Runway and Luma are fighting again! Yann LeCun bombards: No matter how good you are, you are not a "world model"Next article：Runway and Luma are fighting again! Yann LeCun bombards: No matter how good you are, you are not a "world model"

See more

Harbin Institute of Technology proposes an innovative iterative reasoning framework DPE-MNER: giving full play to the potential of multi-modal representation

Related articles