search
HomeTechnology peripheralsAISystematic review of deep reinforcement learning pre-training, online and offline research is enough.

In recent years, reinforcement learning (RL) has developed rapidly, driven by deep learning. Various breakthroughs in fields from games to robotics have stimulated people's interest in designing complex, large-scale RL algorithms and systems. However, existing RL research generally allows agents to learn from scratch when faced with new tasks, making it difficult to use pre-acquired prior knowledge to assist decision-making, resulting in high computational overhead.

In the field of supervised learning, the pre-training paradigm has been verified as an effective way to obtain transferable prior knowledge. By pre-training on large-scale data sets, the network model can Quickly adapt to different downstream tasks. Similar ideas have also been tried in RL, especially the recent research on "generalist" agents [1, 2], which makes people wonder whether something like GPT-3 [3] can also be born in the RL field. Universal pre-trained model.

However, the application of pre-training in the RL field faces many challenges, such as the significant differences between upstream and downstream tasks, how to efficiently obtain and utilize pre-training data, and how to use prior knowledge. Issues such as effective transfer hinder the successful application of pre-training paradigms in RL. At the same time, there are great differences in the experimental settings and methods considered in previous studies, which makes it difficult for researchers to design appropriate pre-training models in real-life scenarios.

In order to sort out the development of pre-training in the field of RL and possible future development directions, Researchers from Shanghai Jiao Tong University and Tencent wrote a review to discuss existing RL Pre-training segmentation methods under different settings and problems to be solved.

Systematic review of deep reinforcement learning pre-training, online and offline research is enough.

##Paper address: https://arxiv.org/pdf/2211.03959.pdf

RL Pre-training Introduction

Reinforcement learning (RL) provides a general mathematical form for sequential decision-making. Through RL algorithms and deep neural networks, agents learned in a data-driven manner and optimizing specified reward functions have achieved performance beyond human performance in various applications in different fields. However, although RL has been proven to be effective in solving specified tasks, sample efficiency and generalization ability are still two major obstacles hindering the application of RL in the real world. In RL research, a standard paradigm is for an agent to learn from experience collected by itself or others, optimizing a neural network through random initialization for a single task. In contrast, for humans, prior knowledge of the world greatly aids the decision-making process. If the task is related to previously seen tasks, humans tend to reuse already learned knowledge to quickly adapt to new tasks without learning from scratch. Therefore, compared with humans, RL agents suffer from low data efficiency and are prone to overfitting.

However, recent advances in other areas of machine learning actively advocate leveraging prior knowledge built from large-scale pre-training. By training at scale on a wide range of data, large foundation models can be quickly adapted to a variety of downstream tasks. This pretraining-finetuning paradigm has proven effective in fields such as computer vision and natural language processing. However, pre-training has not had a significant impact on the RL field. Although this approach is promising, designing principles for large-scale RL pretraining faces many challenges. 1) Diversity of domains and tasks; 2) Limited data sources; 3) Rapid adaptation to the difficulty of solving downstream tasks. These factors arise from the inherent characteristics of RL and require special consideration by researchers.

Pre-training has great potential for RL, and this study can serve as a starting point for those interested in this direction. In this article, researchers attempt to conduct a systematic review of existing pre-training work on deep reinforcement learning.

In recent years, deep reinforcement learning pre-training has experienced several breakthroughs. First, pre-training based on expert demonstrations, which uses supervised learning to predict the actions taken by experts, has been used on AlphaGo. In pursuit of less-supervised large-scale pre-training, the field of unsupervised RL has grown rapidly, which allows agents to learn from interactions with the environment without reward signals. In addition, the rapid development of offline reinforcement learning (offline RL) has prompted researchers to further consider how to use unlabeled and sub-optimal offline data for pre-training. Finally, offline training methods based on multi-task and multi-modal data further pave the way for a general pre-training paradigm.

Systematic review of deep reinforcement learning pre-training, online and offline research is enough.

Online pre-training

In the past, the success of RL was achieved with dense and well-designed reward functions. Traditional RL paradigms, which have made great progress in many fields, face two key challenges when scaling to large-scale pre-training. First, RL agents are easily overfitted, and it is difficult for agents pre-trained with complex task rewards to achieve good performance on tasks they have never seen before. In addition, designing reward functions is usually very expensive and requires a lot of expert knowledge, which is undoubtedly a big challenge in practice.

Online pre-training without reward signals may become an available solution for learning universal prior knowledge and supervised signals without human involvement. Online pre-training aims to acquire prior knowledge through interaction with the environment without human supervision. In the pre-training phase, the agent is allowed to interact with the environment for a long time but cannot receive extrinsic rewards. This solution, also known as unsupervised RL, has been actively studied by researchers in recent years.

In order to motivate agents to acquire prior knowledge from the environment without any supervision signals, a mature method is to design intrinsic rewards for agents to encourage The agent designs reward mechanisms accordingly by collecting diverse experiences or mastering transferable skills. Previous research has shown that agents can quickly adapt to downstream tasks through online pretraining with intrinsic rewards and standard RL algorithms.

Systematic review of deep reinforcement learning pre-training, online and offline research is enough.

Offline pre-training

Although online pre-training can achieve good pre-training results without human supervision, But for large-scale applications, online pre-training is still limited. After all, online interaction is somewhat mutually exclusive with the need to train on large and diverse datasets. In order to solve this problem, people often hope to decouple the data collection and pre-training links and directly use historical data collected from other agents or humans for pre-training.

A feasible solution is offline reinforcement learning. The purpose of offline reinforcement learning is to obtain a reward-maximizing RL policy from offline data. A fundamental challenge is the problem of distribution shift, that is, the difference in distribution between the training data and the data seen during testing. Existing offline reinforcement learning methods focus on how to solve this challenge when using function approximation. For example, policy constraint methods explicitly require the learned policy to avoid taking actions not seen in the data set, and value regularization methods alleviate the problem of overestimation of the value function by fitting the value function to some form of lower bound. However, whether strategies trained offline can generalize to new environments not seen in offline datasets remains underexplored.

Perhaps, we can avoid the learning of RL policies and instead use offline data to learn prior knowledge that is beneficial to the convergence speed or final performance of downstream tasks. More interestingly, if our model can leverage offline data without human supervision, it has the potential to benefit from massive amounts of data. In this paper, researchers refer to this setting as offline pre-training, and the agent can extract important information (such as good representation and behavioral priors) from offline data.

Systematic review of deep reinforcement learning pre-training, online and offline research is enough.

Towards a general agent

The pre-training methods in a single environment and single modality mainly focus on the above mentioned Online pre-training and offline pre-training settings, and recently, researchers in the field have become increasingly interested in building a single general decision-making model (e.g., Gato [1] and Multi-game DT [2]), making the same The model is able to handle tasks of different modalities in different environments. In order to enable agents to learn from and adapt to a variety of open-ended tasks, the research hopes to leverage large amounts of prior knowledge in different forms, such as visual perception and language understanding. More importantly, if researchers can successfully build a bridge between RL and machine learning in other fields, and combine previous successful experiences, they may be able to build a general agent model that can complete various tasks.

The above is the detailed content of Systematic review of deep reinforcement learning pre-training, online and offline research is enough.. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
再掀强化学习变革!DeepMind提出「算法蒸馏」:可探索的预训练强化学习Transformer再掀强化学习变革!DeepMind提出「算法蒸馏」:可探索的预训练强化学习TransformerApr 12, 2023 pm 06:58 PM

在当下的序列建模任务上,Transformer可谓是最强大的神经网络架构,并且经过预训练的Transformer模型可以将prompt作为条件或上下文学习(in-context learning)适应不同的下游任务。大型预训练Transformer模型的泛化能力已经在多个领域得到验证,如文本补全、语言理解、图像生成等等。从去年开始,已经有相关工作证明,通过将离线强化学习(offline RL)视为一个序列预测问题,那么模型就可以从离线数据中学习策略。但目前的方法要么是从不包含学习的数据中学习策略

大模型训练成本降低近一半!新加坡国立大学最新优化器已投入使用大模型训练成本降低近一半!新加坡国立大学最新优化器已投入使用Jul 17, 2023 pm 10:13 PM

优化器在大语言模型的训练中占据了大量内存资源。现在有一种新的优化方式,在性能保持不变的情况下将内存消耗降低了一半。该成果由新加坡国立大学打造,在ACL会议上获得了杰出论文奖,并已经投入了实际应用。图片随着大语言模型不断增加的参数量,训练时的内存消耗问题更为严峻。研究团队提出了CAME优化器,在减少内存消耗的同时,拥有与Adam相同的性能。图片CAME优化器在多个常用的大规模语言模型的预训练上取得了相同甚至超越Adam优化器的训练表现,并对大batch预训练场景显示出更强的鲁棒性。进一步地,通过C

无需下游训练,Tip-Adapter大幅提升CLIP图像分类准确率无需下游训练,Tip-Adapter大幅提升CLIP图像分类准确率Apr 12, 2023 pm 03:25 PM

论文链接:https://arxiv.org/pdf/2207.09519.pdf代码链接:https://github.com/gaopengcuhk/Tip-Adapter一.研究背景对比性图像语言预训练模型(CLIP)在近期展现出了强大的视觉领域迁移能力,可以在一个全新的下游数据集上进行 zero-shot 图像识别。为了进一步提升 CLIP 的迁移性能,现有方法使用了 few-shot 的设置,例如 CoOp 和 CLIP-Adapter,即提供了少量下游数据集的训练数据,使得 CLIP

单机训练200亿参数大模型:Cerebras打破新纪录单机训练200亿参数大模型:Cerebras打破新纪录Apr 18, 2023 pm 12:37 PM

本周,芯片创业公司Cerebras宣布了一个里程碑式的新进展:在单个计算设备中训练了超过百亿参数的NLP(自然语言处理)人工智能模型。由Cerebras训练的AI模型体量达到了前所未有的200亿参数,所有这些都无需横跨多个加速器扩展工作负载。这项工作足以满足目前网络上最火的文本到图像AI生成模型——OpenAI的120亿参数大模型DALL-E。Cerebras新工作中最重要的一点是对基础设施和软件复杂性的要求降低了。这家公司提供的芯片WaferScaleEngine-

用少于256KB内存实现边缘训练,开销不到PyTorch千分之一用少于256KB内存实现边缘训练,开销不到PyTorch千分之一Apr 08, 2023 pm 01:11 PM

说到神经网络训练,大家的第一印象都是 GPU + 服务器 + 云平台。传统的训练由于其巨大的内存开销,往往是云端进行训练而边缘平台仅负责推理。然而,这样的设计使得 AI 模型很难适应新的数据:毕竟现实世界是一个动态的,变化的,发展的场景,一次训练怎么能覆盖所有场景呢?为了使得模型能够不断的适应新数据,我们能否在边缘进行训练(on-device training),使设备不断的自我学习?在这项工作中,我们仅用了不到 256KB 内存就实现了设备上的训练,开销不到 PyTorch 的 1/1000,

图像质量堪忧干扰视觉识别,达摩院提出更鲁棒框架图像质量堪忧干扰视觉识别,达摩院提出更鲁棒框架Apr 14, 2023 pm 04:31 PM

本文介绍被机器学习顶级国际会议AAAI2023接收的论文《ImprovingTrainingandInferenceofFaceRecognitionModelsviaRandomTemperatureScaling》。论文创新性地从概率视角出发,对分类损失函数中的温度调节参数和分类不确定度的内在关系进行分析,揭示了分类损失函数的温度调节因子是服从Gumbel分布的不确定度变量的尺度系数。从而提出一个新的被叫做RTS的训练框架对特征抽取的可靠性进行建模。基于RTS

三维场景生成:无需任何神经网络训练,从单个样例生成多样结果三维场景生成:无需任何神经网络训练,从单个样例生成多样结果Jun 09, 2023 pm 08:22 PM

多样高质的三维场景生成结果论文地址:https://arxiv.org/abs/2304.12670项目主页:http://weiyuli.xyz/Sin3DGen/引言使用人工智能辅助内容生成(AIGC)在图像生成领域涌现出大量的工作,从早期的变分自编码器(VAE),到生成对抗网络(GAN),再到最近大红大紫的扩散模型(DiffusionModel),模型的生成能力飞速提升。以StableDiffusion,Midjourney等为代表的模型在生成具有高真实感图像方面取得了前所未有的成果。同时

AI绘画侵权实锤!扩散模型可能记住你的照片,现有隐私保护方法全部失效AI绘画侵权实锤!扩散模型可能记住你的照片,现有隐私保护方法全部失效Apr 12, 2023 pm 10:16 PM

本文经AI新媒体量子位(公众号ID:QbitAI)授权转载,转载请联系出处。AI绘画侵权,实锤了!最新研究表明,扩散模型会牢牢记住训练集中的样本,并在生成时“依葫芦画瓢”。也就是说,像Stable Diffusion生成的AI画作里,每一笔背后都可能隐藏着一次侵权事件。不仅如此,经过研究对比,扩散模型从训练样本中“抄袭”的能力是GAN的2倍,且生成效果越好的扩散模型,记住训练样本的能力越强。这项研究来自Google、DeepMind和UC伯克利组成的团队。论文中还有另一个糟糕的消息,那就是针对这

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.