집 >백엔드 개발 >파이썬 튜토리얼 >PyTorch 코드 구현 및 DDPG 강화 학습의 단계별 설명

PyTorch 코드 구현 및 DDPG 강화 학습의 단계별 설명

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB앞으로: 2023-04-13 09:10:071994검색

DDPG(Deep Deterministic Policy Gradient)는 Deep Q-Network에서 영감을 받은 모델이 없는 비정책 심층 강화 알고리즘입니다. 이 기사에서는 정책 그라데이션을 사용하는 Actor-Critic을 기반으로 합니다.

DDPG의 구현 및 설명은

Replay Buffer
Actor-Critic Neural Network
Exploration Noise
Target network
Soft Target Updates for Target Network

다음은 하나입니다. 구현할 하나 단계별:

Replay Buffer

DDPG는 Replay Buffer를 사용하여 환경(Sₜ, aₜ, Rₜ, Sₜ+₁)을 탐색하여 샘플링된 프로세스와 보상을 저장합니다. Replay Buffer는 에이전트가 학습을 가속화하고 DDPG의 안정성을 높이는 데 중요한 역할을 합니다.

샘플 간의 상관 관계를 최소화합니다. 과거 경험을 Replay Buffer에 저장하여 에이전트가 중학교 학습의 다양한 경험을 통해 학습할 수 있도록 합니다.
오프라인 정책 학습 활성화: 에이전트가 현재 정책에서 전환을 샘플링하는 대신 재생 버퍼에서 전환을 샘플링할 수 있습니다.
효율적인 샘플링: 과거 경험을 버퍼에 저장하여 에이전트가 다양한 경험에서 여러 번 학습할 수 있도록 합니다.

class Replay_buffer():
 '''
Code based on:
https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py
Expects tuples of (state, next_state, action, reward, done)
'''
 def __init__(self, max_size=capacity):
 """Create Replay buffer.
Parameters
----------
size: int
Max number of transitions to store in the buffer. When the buffer
overflows the old memories are dropped.
"""
 self.storage = []
 self.max_size = max_size
 self.ptr = 0
 
 def push(self, data):
 if len(self.storage) == self.max_size:
 self.storage[int(self.ptr)] = data
 self.ptr = (self.ptr + 1) % self.max_size
 else:
 self.storage.append(data)
 
 def sample(self, batch_size):
 """Sample a batch of experiences.
Parameters
----------
batch_size: int
How many transitions to sample.
Returns
-------
state: np.array
batch of state or observations
action: np.array
batch of actions executed given a state
reward: np.array
rewards received as results of executing action
next_state: np.array
next state next state or observations seen after executing action
done: np.array
done[i] = 1 if executing ation[i] resulted in
the end of an episode and 0 otherwise.
"""
 ind = np.random.randint(0, len(self.storage), size=batch_size)
 state, next_state, action, reward, done = [], [], [], [], []
 
 for i in ind:
 st, n_st, act, rew, dn = self.storage[i]
 state.append(np.array(st, copy=False))
 next_state.append(np.array(n_st, copy=False))
 action.append(np.array(act, copy=False))
 reward.append(np.array(rew, copy=False))
 done.append(np.array(dn, copy=False))
 
 return np.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1)

Actor-Critic Neural Network

이것은 Actor-Critic 강화 학습 알고리즘을 PyTorch로 구현한 것입니다. 이 코드는 두 개의 신경망 모델인 Actor와 Critic을 정의합니다.

액터 모델의 입력: 환경 상태, 액터 모델의 출력: 연속 값을 갖는 작업.

비평가 모델의 입력: 환경 상태 및 행동, 비판 모델의 출력: Q 값은 현재 상태-행동 쌍의 예상 총 보상입니다.

class Actor(nn.Module):
 """
The Actor model takes in a state observation as input and
outputs an action, which is a continuous value.
 
It consists of four fully connected linear layers with ReLU activation functions and
a final output layer selects one single optimized action for the state
"""
 def __init__(self, n_states, action_dim, hidden1):
 super(Actor, self).__init__()
 self.net = nn.Sequential(
 nn.Linear(n_states, hidden1),
 nn.ReLU(),
 nn.Linear(hidden1, hidden1),
 nn.ReLU(),
 nn.Linear(hidden1, hidden1),
 nn.ReLU(),
 nn.Linear(hidden1, 1)
)
 
 def forward(self, state):
 return self.net(state)
 
 class Critic(nn.Module):
 """
The Critic model takes in both a state observation and an action as input and
outputs a Q-value, which estimates the expected total reward for the current state-action pair.
 
It consists of four linear layers with ReLU activation functions,
State and action inputs are concatenated before being fed into the first linear layer.
 
The output layer has a single output, representing the Q-value
"""
 def __init__(self, n_states, action_dim, hidden2):
 super(Critic, self).__init__()
 self.net = nn.Sequential(
 nn.Linear(n_states + action_dim, hidden2),
 nn.ReLU(),
 nn.Linear(hidden2, hidden2),
 nn.ReLU(),
 nn.Linear(hidden2, hidden2),
 nn.ReLU(),
 nn.Linear(hidden2, action_dim)
)
 
 def forward(self, state, action):
 return self.net(torch.cat((state, action), 1))

Exploration Noise

배우가 선택한 동작에 노이즈를 추가하는 것은 탐색을 장려하고 학습 과정을 개선하기 위해 DDPG에서 사용하는 기술입니다.

Gaussian 잡음이나 Ornstein-Uhlenbeck 잡음을 사용할 수 있습니다. 가우스 잡음은 간단하고 구현하기 쉬우며 Ornstein-Uhlenbeck 잡음은 에이전트가 작업 공간을 보다 효율적으로 탐색하는 데 도움이 될 수 있는 시간 상관 잡음을 생성합니다. 그러나 Ornstein-Uhlenbeck 잡음 변동은 가우스 잡음 방법보다 더 부드럽고 덜 무작위적입니다.

import numpy as np
 import random
 import copy
 
 class OU_Noise(object):
 """Ornstein-Uhlenbeck process.
code from :
https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
The OU_Noise class has four attributes
 
size: the size of the noise vector to be generated
mu: the mean of the noise, set to 0 by default
theta: the rate of mean reversion, controlling how quickly the noise returns to the mean
sigma: the volatility of the noise, controlling the magnitude of fluctuations
"""
 def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
 self.mu = mu * np.ones(size)
 self.theta = theta
 self.sigma = sigma
 self.seed = random.seed(seed)
 self.reset()
 
 def reset(self):
 """Reset the internal state (= noise) to mean (mu)."""
 self.state = copy.copy(self.mu)
 
 def sample(self):
 """Update internal state and return it as a noise sample.
This method uses the current state of the noise and generates the next sample
"""
 dx = self.theta * (self.mu - self.state) + self.sigma * np.array([np.random.normal() for _ in range(len(self.state))])
 self.state += dx
 return self.state

DDPG에서 가우스 노이즈를 사용하려면 에이전트의 작업 선택 프로세스에 가우스 노이즈를 직접 추가하면 됩니다.

DDPG

DDPG(Deep Deterministic Policy Gradient)는 기능 근사를 위해 두 세트의 행위자-비평가 신경망을 사용합니다. DDPG에서 대상 네트워크는 Actor-Critic이며, 이는 Actor-Critic 네트워크와 동일한 구조 및 매개변수화를 갖습니다.

훈련 기간 동안 에이전트는 Actor-Critic 네트워크를 사용하여 환경과 상호 작용하고 경험 튜플(Sₜ, Aₜ, Rₜ, Sₜ+₁)을 재생 버퍼에 저장합니다. 그런 다음 에이전트는 재생 버퍼에서 샘플링하고 Actor-Critic 네트워크를 데이터로 업데이트합니다. Actor-Critic 네트워크에서 직접 복사하여 대상 네트워크 가중치를 업데이트하는 대신 DDPG 알고리즘은 소프트 대상 업데이트라는 프로세스를 통해 대상 네트워크 가중치를 천천히 업데이트합니다.

PyTorch 코드 구현 및 DDPG 강화 학습의 단계별 설명

소프트 타겟의 업데이트는 Actor-Critic 네트워크에서 타겟 업데이트 속도(τ)라고 하는 타겟 네트워크로 전송되는 가중치의 일부입니다.

소프트 타겟의 업데이트 공식은 다음과 같습니다.

PyTorch 코드 구현 및 DDPG 강화 학습의 단계별 설명

소프트 타겟 기술을 사용하면 학습 안정성을 크게 향상시킬 수 있습니다.

#Set Hyperparameters
 # Hyperparameters adapted for performance from
 capacity=1000000
 batch_size=64
 update_iteration=200
 tau=0.001 # tau for soft updating
 gamma=0.99 # discount factor
 directory = './'
 hidden1=20 # hidden layer for actor
 hidden2=64. #hiiden laye for critic
 
 class DDPG(object):
 def __init__(self, state_dim, action_dim):
 """
Initializes the DDPG agent.
Takes three arguments:
state_dim which is the dimensionality of the state space,
action_dim which is the dimensionality of the action space, and
max_action which is the maximum value an action can take.
 
Creates a replay buffer, an actor-critic networks and their corresponding target networks.
It also initializes the optimizer for both actor and critic networks alog with
counters to track the number of training iterations.
"""
 self.replay_buffer = Replay_buffer()
 
 self.actor = Actor(state_dim, action_dim, hidden1).to(device)
 self.actor_target = Actor(state_dim, action_dim,hidden1).to(device)
 self.actor_target.load_state_dict(self.actor.state_dict())
 self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=3e-3)
 
 self.critic = Critic(state_dim, action_dim,hidden2).to(device)
 self.critic_target = Critic(state_dim, action_dim,hidden2).to(device)
 self.critic_target.load_state_dict(self.critic.state_dict())
 self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=2e-2)
 # learning rate
 
 
 
 self.num_critic_update_iteration = 0
 self.num_actor_update_iteration = 0
 self.num_training = 0
 
 def select_action(self, state):
 """
takes the current state as input and returns an action to take in that state.
It uses the actor network to map the state to an action.
"""
 state = torch.FloatTensor(state.reshape(1, -1)).to(device)
 return self.actor(state).cpu().data.numpy().flatten()
 
 
 def update(self):
 """
updates the actor and critic networks using a batch of samples from the replay buffer.
For each sample in the batch, it computes the target Q value using the target critic network and the target actor network.
It then computes the current Q value
using the critic network and the action taken by the actor network.
 
It computes the critic loss as the mean squared error between the target Q value and the current Q value, and
updates the critic network using gradient descent.
 
It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and
updates the actor network using gradient ascent.
 
Finally, it updates the target networks using
soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts.
This process is repeated for a fixed number of iterations.
"""
 
 for it in range(update_iteration):
 # For each Sample in replay buffer batch
 state, next_state, action, reward, done = self.replay_buffer.sample(batch_size)
 state = torch.FloatTensor(state).to(device)
 action = torch.FloatTensor(action).to(device)
 next_state = torch.FloatTensor(next_state).to(device)
 done = torch.FloatTensor(1-done).to(device)
 reward = torch.FloatTensor(reward).to(device)
 
 # Compute the target Q value
 target_Q = self.critic_target(next_state, self.actor_target(next_state))
 target_Q = reward + (done * gamma * target_Q).detach()
 
 # Get current Q estimate
 current_Q = self.critic(state, action)
 
 # Compute critic loss
 critic_loss = F.mse_loss(current_Q, target_Q)
 
 # Optimize the critic
 self.critic_optimizer.zero_grad()
 critic_loss.backward()
 self.critic_optimizer.step()
 
 # Compute actor loss as the negative mean Q value using the critic network and the actor network
 actor_loss = -self.critic(state, self.actor(state)).mean()
 
 # Optimize the actor
 self.actor_optimizer.zero_grad()
 actor_loss.backward()
 self.actor_optimizer.step()
 
 
 """
Update the frozen target models using
soft updates, where
tau,a small fraction of the actor and critic network weights are transferred to their target counterparts.
"""
 for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
 target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
 
 for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
 target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
 
 
 self.num_actor_update_iteration += 1
 self.num_critic_update_iteration += 1
 def save(self):
 """
Saves the state dictionaries of the actor and critic networks to files
"""
 torch.save(self.actor.state_dict(), directory + 'actor.pth')
 torch.save(self.critic.state_dict(), directory + 'critic.pth')
 
 def load(self):
 """
Loads the state dictionaries of the actor and critic networks to files
"""
 self.actor.load_state_dict(torch.load(directory + 'actor.pth'))
 self.critic.load_state_dict(torch.load(directory + 'critic.pth'))

Training DDPG

여기에서는 OpenAI Gym의 "MountainCarContinuous-v0"을 사용하여 DDPG RL 모델을 훈련합니다. 이곳의 환경은 지속적인 행동과 관찰 공간을 제공하며 목표는 차량을 최대한 빨리 산 정상에 올리는 것입니다. 가능한 한.

PyTorch 코드 구현 및 DDPG 강화 학습의 단계별 설명

최대 훈련 횟수, 탐색 노이즈, 기록 간격 등과 같은 알고리즘의 다양한 매개 변수는 아래에 정의되어 있습니다. 고정된 무작위 시드를 사용하면 프로세스를 역추적할 수 있습니다.

import gym
 
 # create the environment
 env_name='MountainCarContinuous-v0'
 env = gym.make(env_name)
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
 
 # Define different parameters for training the agent
 max_episode=100
 max_time_steps=5000
 ep_r = 0
 total_step = 0
 score_hist=[]
 # for rensering the environmnet
 render=True
 render_interval=10
 # for reproducibility
 env.seed(0)
 torch.manual_seed(0)
 np.random.seed(0)
 #Environment action ans states
 state_dim = env.observation_space.shape[0]
 action_dim = env.action_space.shape[0]
 max_action = float(env.action_space.high[0])
 min_Val = torch.tensor(1e-7).float().to(device)
 
 # Exploration Noise
 exploration_noise=0.1
 exploration_noise=0.1 * max_action

DDPG 에이전트 클래스의 인스턴스를 생성하여 지정된 횟수만큼 에이전트를 교육합니다. 에이전트의 update() 메서드는 각 라운드가 끝날 때 호출되어 매개변수를 업데이트하고, save() 메서드는 매 10라운드마다 에이전트의 매개변수를 파일에 저장하는 데 사용됩니다.

# Create a DDPG instance
 agent = DDPG(state_dim, action_dim)
 
 # Train the agent for max_episodes
 for i in range(max_episode):
 total_reward = 0
 step =0
 state = env.reset()
 fort in range(max_time_steps):
 action = agent.select_action(state)
 # Add Gaussian noise to actions for exploration
 action = (action + np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action)
 #action += ou_noise.sample()
 next_state, reward, done, info = env.step(action)
 total_reward += reward
 if render and i >= render_interval : env.render()
 agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))
 state = next_state
 if done:
 break
 step += 1
 
 score_hist.append(total_reward)
 total_step += step+1
 print("Episode: t{} Total Reward: t{:0.2f}".format( i, total_reward))
 agent.update()
 if i % 10 == 0:
 agent.save()
 env.close()

DDPG 테스트

test_iteration=100
 
 for i in range(test_iteration):
 state = env.reset()
 for t in count():
 action = agent.select_action(state)
 next_state, reward, done, info = env.step(np.float32(action))
 ep_r += reward
 print(reward)
 env.render()
 if done:
 print("reward{}".format(reward))
 print("Episode t{}, the episode reward is t{:0.2f}".format(i, ep_r))
 ep_r = 0
 env.render()
 break
 state = next_state

다음 매개변수를 사용하여 모델을 수렴시킵니다.

무작위 샘플링 대신 표준 정규 분포에서 노이즈를 샘플링합니다.
폴리악 상수(tau)를 0.99에서 0.001로 변경하세요.
비평가 네트워크의 숨겨진 레이어 크기를 [64,64]로 수정하세요. ReLU 활성화는 Critic 네트워크의 두 번째 계층 이후에 제거됩니다. (선형, ReLU, 선형, 선형)으로 변경합니다.
최대 버퍼 크기가 1000000으로 변경되었습니다.
batch_size 크기를 128에서 64로 변경합니다.

75회 학습 후 효과는 다음과 같습니다.

PyTorch 코드 구현 및 DDPG 강화 학습의 단계별 설명

요약

DDPG 알고리즘은 메서드입니다. 네트워크(DQN) 알고리즘에서 영감을 받은 Deep Q -Model-free off-policy Actor-Critic 알고리즘의 영향을 받았습니다. 정책 경사 방법과 Q-학습의 장점을 결합하여 연속 행동 공간에서 결정론적 정책을 학습합니다.

DQN과 유사하게 재생 버퍼를 사용하여 네트워크 훈련을 위한 과거 경험과 대상 네트워크를 저장함으로써 훈련 프로세스의 안정성을 향상시킵니다.

DDPG 알고리즘은 최적의 성능을 위해 신중한 하이퍼파라미터 조정이 필요합니다. 하이퍼파라미터에는 학습률, 배치 크기, 대상 네트워크 업데이트 속도, 감지 노이즈 매개변수가 포함됩니다. 하이퍼파라미터의 작은 변화는 알고리즘 성능에 큰 영향을 미칠 수 있습니다.

위 내용은 PyTorch 코드 구현 및 DDPG 강화 학습의 단계별 설명의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

for 算法 pytorch

성명：

이 기사는 51cto.com에서 복제됩니다. 침해가 있는 경우 admin@php.cn으로 문의하시기 바랍니다. 삭제

이전 기사：나는 Python을 사용하여 WeChat 친구를 크롤링했는데 그들은 다음과 같습니다...다음 기사：나는 Python을 사용하여 WeChat 친구를 크롤링했는데 그들은 다음과 같습니다...