如图1所示,强化学习中,state是环境的状态,就是observation。
图1 强化学习
一、Policy based approach---learning an actor
The policy based approach is to learn an actor (agent or policy).
图2 example of policy based approach
所谓 on-policy (同策略)指我们学习的 agent(即actor)和与环境交互的 agent 是相同的,即 agent 一边和环境互动,一边学习;行动策略和目标策略是同一个策略
而off-policy (异策略)指我们学习的 agent 与环境交互的 agent 是不同的,即 agent 通过看别人玩游戏来学习。行动策略和目标策略不是同一个策略。有些文献上Actor又叫Policy。
(Proximal Policy Optimization)PPO近端策略优化
PPO is an off-policy
Goal:using the sample from $\pi _{\theta ^{\prime } } $ to train $\theta$, $\theta ^{\prime }$ is fixed, so we can re-use the sample data.
- Reinforcement learningreinforcement learning noise reinforcement exploration learning reinforcement transformer learning trainer reinforcement learning chapter reinforcement distillation teachable learning reinforcement transformer decision learning reinforcement exploration off-policy learning reinforcement modelling learning feedback reinforcement adversarial learning through reinforcement learning笔记