Bootstrap

强化学习代码实践1.DDQN:在CartPole游戏中实现 Double DQN

强化学习代码实践1.DDQN:在CartPole游戏中实现 Double DQN

CartPole 游戏中实现 Double DQN(DDQN)训练网络时,我们需要构建一个使用两个 Q 网络(一个用于选择动作,另一个用于更新目标)的方法。Double DQN 通过引入目标网络来减少 Q-learning 中过度估计的偏差。

下面是一个基于 PyTorch 的 Double DQN 实现:

1. 导入依赖

import random
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gym
from collections import deque

2. 定义 Q 网络

我们需要定义一个 Q 网络,用于计算 Q 值。这里使用简单的全连接网络。

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

3. 创建 Agent

class DoubleDQNAgent:
    def __init__(self, state_dim, action_dim, gamma=0.99, epsilon=0.1, epsilon_decay=0.995, epsilon_min=0.01, lr=0.0005):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.lr = lr

        self.q_network = QNetwork(state_dim, action_dim)
        self.target_network = QNetwork(state_dim, action_dim)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=self.lr)

        self.memory = deque(maxlen=10000)
        self.batch_size = 64

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.choice(range(self.action_dim))  # Explore
        else:
            state = torch.FloatTensor(state).unsqueeze(0)
            with torch.no_grad():
                q_values = self.q_network(state)
            return torch.argmax(q_values).item()  # Exploit

    def store_experience(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def sample_batch(self):
        return random.sample(self.memory, self.batch_size)

    def update_target_network(self):
        self.target_network.load_state_dict(self.q_network.state_dict())

    def train(self):
        if len(self.memory) < self.batch_size:
            return

        batch = self.sample_batch()
        states, actions, rewards, next_states, dones = zip(*batch)

        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones)

        # Q values for current states
        q_values = self.q_network(states)
        q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)

        # Next Q values using target network
        next_q_values = self.target_network(next_states)
        next_actions = self.q_network(next_states).argmax(1)
        next_q_values = next_q_values.gather(1, next_actions.unsqueeze(1)).squeeze(1)

        # Double DQN update
        target = rewards + (1 - dones) * self.gamma * next_q_values

        # Compute loss
        loss = nn.MSELoss()(q_values, target)

        # Optimize the Q-network
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

4. 训练过程

def train_cartpole():
    env = gym.make('CartPole-v1')
    agent = DoubleDQNAgent(state_dim=env.observation_space.shape[0], action_dim=env.action_space.n)

    episodes = 1000
    for episode in range(episodes):
        state, info = env.reset()
        done = False
        total_reward = 0

        while not done:
            action = agent.select_action(state)
            next_state, reward, done, truncated, info = env.step(action)
            agent.store_experience(state, action, reward, next_state, done)
            state = next_state

            agent.train()
            total_reward += reward

        agent.update_target_network()

        print(f"Episode {episode}, Total Reward: {total_reward}, Epsilon: {agent.epsilon:.4f}")

    env.close()

if __name__ == '__main__':
    train_cartpole()

5. 解释

  • QNetwork: 使用一个简单的 3 层全连接神经网络来近似 Q 函数。
  • DoubleDQNAgent:
    • select_action: 根据 ε-greedy 策略选择动作。
    • store_experience: 存储经验回放。
    • sample_batch: 从记忆中随机采样批次。
    • train: 更新 Q 网络的权重,使用 Double DQN 的目标计算方法。
    • update_target_network: 每一定步数更新目标网络。
  • 训练过程: 在每一回合中,代理与环境互动并更新 Q 网络,通过经验回放机制逐步学习。

6. 调整超参数

  • gamma: 折扣因子,控制未来奖励的影响。
  • epsilon: 初始的探索率,随着训练的进行逐渐减小。
  • lr: 学习率,控制权重更新的步伐。
  • batch_size: 每次更新时,从记忆库中采样的批量大小。

这个代码可以直接用于训练一个 CartPole 的 Double DQN 代理,逐步优化 Q 网络来完成游戏任务。如果你有更复杂的需求,像更深的网络结构或其他改进,可以在此基础上进一步扩展。

;