Figure 4: Examples of simulated data in four scenarios.
Working Mode. The working mode will affect the performance of the agent, especially for complex GUI environments. The level of environmental awareness is the bottleneck of the agent's performance. It determines whether the agent can capture effective actions and indicates the upper limit of action prediction. They implemented three working modes with different levels of environmental awareness, namely implicit perception, partial perception and optimal perception. (1) Implicit perception means directly placing requirements on the agent. The input is only instructions and screens, and does not assist in environmental perception (Direct prompt). (2) Partial perception prompts the agent to first analyze the environment, using a mode similar to the thinking chain. The agent first receives the screenshot status to extract possible operations, and then predicts the next operation (CoT prompt) based on the goal. (3) The best perception is to directly provide the operation space of the screen to the agent (w/ Action annotation). Essentially, different working modes mean two changes: information about potential operations is exposed to the agent, and information is merged from the visual channel into the text channel.
Experiment and Analysis The research team conducted experiments on 10 well-known multi-modal large models on 1189 pieces of simulated data constructed. For systematic analysis, we selected two types of models as GUI agents, (1) general models, including powerful black-box large models based on API services (GPT-4v, GPT-4o, GLM-4v, Qwen-VL -plus, Claude-Sonnet-3.5), and open source large models (Qwen-VL-chat, MiniCPM-Llama3-v2.5, LLaVa-v1.6-34B). (2) GUI expert models, including CogAgent-chat and SeeClick that have been pre-trained or fine-tuned with instructions. The indicators used by the research team are , which respectively correspond to the accuracy of the model's predicted action matching successful best action, interfered action, and invalid action. The research team summarized the findings in the experiment into answers to three questions:
Will a multi-modal environment interfere with the goals of the GUI Agent? In risky environments, multimodal agents are susceptible to interference, which can cause them to abandon goals and behave disloyally. In each of the team's four scenarios, the model produced behavior that deviated from the original goal, which reduced the accuracy of the action. The strong API model (9.09% for GPT-4o) and the expert model (6.84% for SeeClick) are more faithful than the general open source model.
What is the relationship between fidelity and helpfulness? This is divided into two situations. First, there are powerful models that can provide correct actions while remaining faithful (GPT-4o, GPT-4v, and Claude). They exhibit low scores, as well as relatively high and low . However, greater perception but less fidelity results in greater susceptibility to interference and reduced usefulness. For example, GLM-4v exhibits higher and much lower compared to open source models.Therefore, fidelity and usefulness are not mutually exclusive, but can be enhanced simultaneously, and in order to match the capabilities of a powerful model, it is even more important to enhance fidelity.
Can assisted multimodal environmental awareness help mitigate infidelity? By implementing different working modes, visual information is integrated into text channels to enhance environmental awareness. However, results show that GUI-aware text enhancement can actually increase interference, and the increase in interference actions can even outweigh its benefits. CoT mode acts as a self-guided text enhancement that can significantly reduce perceptual burden, but also increases interference. Therefore, even if the perception of this performance bottleneck is enhanced, the vulnerability of fidelity still exists and is even more risky. Therefore, information fusion across textual and visual modalities such as OCR must be more careful.
Figure 5: Environmental interference test results. In addition, in the comparison of models, the research team found that the API-based model outperformed the open source model in terms of fidelity and effectiveness. Pretraining for GUI can greatly improve the fidelity and effectiveness of expert agents, but it may introduce shortcuts that lead to failure. In the comparison of working modes, the research team further stated that even with “perfect” perception (action annotation), the agent is still susceptible to interference. CoT prompts no complete defense, but a self-guided step-by-step process demonstrates the potential for mitigation. Finally, using the above findings, the research team considered an extreme case with an adversarial role and demonstrated a feasible active attack, called the environment Inject . Consider an attack scenario where the attacker needs to change the GUI environment to mislead the model. An attacker can eavesdrop on messages from users and obtain targets, and can compromise related data to change environmental information. For example, an attacker can intercept packets from the host and change the content of a website. The setting of environment injection is different from the previous one. The previous article looked at the common problem of imperfect, noisy, or defective environments that attackers can induce by creating unusual or malicious content. The research team conducted verification on the pop-up scene and proposed and implemented a simple and effective method to rewrite these two buttons. (1) The button that accepts the bullet box is rewritten to be ambiguous, which is reasonable for both distractors and real targets. We found a common operation for both purposes. While the contents of the box provide context and indicate the button's true function, models often ignore the context's meaning. (2) The button to reject the pop-up box has been rewritten as an emotional expression. This guiding emotion can sometimes influence or even manipulate user decisions. This phenomenon is common when uninstalling a program, such as "Brutal Leave". These rewriting methods reduce the fidelity of GLM-4v and GPT-4o and significantly improve the score compared to the baseline score. GLM-4v is more susceptible to emotional expressions, while GPT-4o is more susceptible to ambiguous acceptance misguidance. Figure 6: Experimental results of malicious environment injection. Summary This article The fidelity of multi-modal GUI Agent is studied and the influence of environmental interference is revealed. The research team proposed a new research question - environmental interference of agents, and a new research scenario - both users and agents are benign, and the environment is not malicious, but there is content that can distract attention. The research team simulated interference in four scenarios and implemented three working modes with different perception levels. A wide range of general models and GUI expert models are evaluated. Experimental results show that vulnerability to interference significantly reduces fidelity and helpfulness, and that protection cannot be accomplished through enhanced perception alone. In addition, the research team proposed an attack method called environmental injection, which exploits infidelity by changing the interference to include ambiguous or emotionally misleading content. Malicious purpose. More importantly, this paper calls for greater attention to the fidelity of multimodal agents. The research team recommends that future work include pre-training for fidelity, considering correlations between environmental context and user instructions, predicting possible consequences of performing actions, and introducing human-computer interaction when necessary.