Reinforcement Learning之Q-Learning - Python实现

发布时间 2023-06-03 21:58:51作者: LOGAN_XIONG
  • 算法特征
    ①. 以真实reward训练Q-function; ②. 从最大Q方向更新policy \(\pi\)

  • 算法推导
    Part Ⅰ: RL之原理
    整体交互流程如下,

    定义策略函数(policy)\(\pi\), 输入为状态(state)\(s\), 输出为动作(action)\(a\), 则,

    \[\begin{equation*} a = \pi(s) \end{equation*} \]

    令交互序列为\(\{\cdots, s_t, a_t, r_t, s_{t+1}, \cdots\}\). 定义状态值函数(state value function)\(V^{\pi}(s)\), 表示agent在当前状态\(s\)下采取策略\(\pi\)与environment持续交互所得累计奖励之期望(cumulated reward expectation), 则,

    \[\begin{equation*} V^{\pi}(s_t) = r_t + V^{\pi}(s_{t+1}) \end{equation*} \]

    定义状态动作值函数(state-action value function)\(Q^{\pi}(s, a)\), 表示agent在当前状态\(s\)下强制采取动作\(a\)接下来采取策略\(\pi\)与environment持续交互所得累计奖励之期望, 则,

    \[\begin{equation*} \begin{split} Q^{\pi}(s_t, a_t) &= r_t + Q^{\pi}(s_{t+1},\pi(s_{t+1})) \\ &= r_t + V^{\pi}(s_{t+1}) \end{split} \end{equation*} \]

    定义最优\(\pi^*\), 使状态值函数\(V^{\pi}\)最大化, 即,

    \[\begin{equation*} \begin{split} \pi^* &= \mathop{\arg\max}_{\pi}\ V^{\pi}(s) \\ &= \mathop{\arg\max}_{\pi}\ Q^{\pi}(s, \pi(s)) \end{split} \end{equation*} \]

    连续空间中, Q-function实现如下,

    离散空间中, Q-function实现如下,

    Part Ⅱ: RL之实现
    训练tips:
    ①. target network中Q-function在一定训练次数内可以保持不变
    ②. exploration使数据采集更加丰富

    • Epsilon Greedy

      \[\begin{equation*} a = \left\{ \begin{split} \mathop{\arg\max}_{a}\ Q(s, a), \quad & \text{with probability }1-\varepsilon\\ \text{random}, \quad\quad\quad & \text{with probability }\varepsilon \\ \end{split} \right. \end{equation*} \]

      \(\varepsilon\)取值在训练期间逐渐减小
    • Boltzmann Exploration

      \[P(a|s) = \frac{\exp(Q(s,a))}{\sum_a\exp(Q(s,a))} \]

    ③. Replay Buffer中以\((s_t, a_t, r_t, s_{t+1})\)为单位进行数据存储, 数据可以来源于不同的policy \(\pi\). buffer中保存新数据剔除旧数据, 每次迭代中随机抽取某个batch进行训练.
    算法流程如下,
    Initialize Q-function \(Q\), target Q-function \(\bar{Q} = Q\) (Note: target nework)
    for each episode
    \(\quad\) for each time step \(t\)
    \(\qquad\) Given state \(s_t\), take action \(a_t\) based on \(Q\) (Note: exploration)
    \(\qquad\) Obtain reward \(r_t\), and reach new state \(s_{t+1}\)
    \(\qquad\) Store \((s_t,a_t,r_t,s_{t+1})\) into buffer (Note: replay buffer)
    \(\qquad\) Sample \((s_i, a_i, r_i, s_{i+1})\) from buffer (usually a batch)
    \(\qquad\) Target \(\bar{y}=r_i + \max_a\ \bar{Q}(s_{i+1}, a)\)
    \(\qquad\) Update the parameters of \(Q\) to make \(Q(s_i, a_i)\) close to \(\bar{y}\) (regression)
    \(\qquad\) Every \(c\) steps reset \(\bar{Q}=Q\)

  • 代码实现
    本文以SnakeGame为例进行算法实施, 将游戏网格化, 并将输入feature分配在不同channel上, 最后利用卷积抽取并识别该feature. 具体实现如下,
    Q-Learning-for-SnakeGame

  • 结果展示

    可以看到, SnakeGame之训练效果符合预期.

  • 使用建议
    ①. Batch Normalization使特征分布在原点附近, 不容易数值溢出
    ②. 复杂特征可以由二元特征组成, 具体含义自定义

  • 参考文档
    ①. Q-Learning强化学习 - 李宏毅
    ②. Python + PyTorch + Pygame Reinforcement Learning – Train an AI to Play Snake