Off-Policy

off-policy RL | Advantage-Weighted Regression (AWR)：组合先前策略得到新 base policy

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning 论文题目：Advantage-Weighted Regression: Simple and Scalable Off-Polic ......

policy Advantage-Weighted off-policy Regression Advantage更新时间 2023-11-13

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

郑重声明：原文参见标题，如有侵权，请联系作者，将会撤销发布！ Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5331-5340, 2019 ......

Meta-Reinforcement Reinforcement Probabilistic Off-Policy Efficient更新时间 2023-09-19

Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling

![](https://img2023.cnblogs.com/blog/1428973/202308/1428973-20230812075327194-1111056360.png) **发表时间：**2020（ICML 2020） **文章要点：**这篇文章基于SAC做简单并且有效的改进来提升 ......

Normalization Performance Non-Uniform Simplicity Off-Policy更新时间 2023-08-12

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

**发表时间：**2021 (NeurIPS 2021) **文章要点：**理论表明，更高的hindsight TD error，更加on policy,以及更准的target Q value的样本应该有更高的采样权重（The theory suggests that data with highe ......

Reinforcement Minimization Experience Off-Policy Learning更新时间 2023-07-10

Off-Policy Deep Reinforcement Learning without Exploration

**发表时间：**2019（ICML 2019） **文章要点：**这篇文章想说在offline RL的setting下，由于外推误差（extrapolation errors）的原因，标准的off-policy算法比如DQN，DDPG之类的，如果数据的分布和当前policy的分布差距很大的话，那就 ......

Reinforcement Exploration Off-Policy Learning without更新时间 2023-05-21

Learning Off-Policy with Online Planning

**发表时间：**2021（CoRL 2021） **文章要点：**这篇文章提出Off-Policy with Online Planning (LOOP)算法，将H-step lookahead with a learned model和terminal value function learne ......

Off-Policy Learning Planning Policy Online更新时间 2023-04-23

Value targets in off-policy AlphaZero: a new greedy backup

**发表时间：**2021 **文章要点：**这篇文章给AlphaZero设计了一个新的value targets，AlphaZero with greedy backups (A0GB)。 AlphaZero的树里面有探索，而value又是所有结果的平均，所以并不准确。而选动作也是依概率选的，但真 ......

off-policy AlphaZero targets greedy backup更新时间 2023-04-16

行为策略与目标策略、On-policy与Off-policy

在强化学习中，行为策略和目标策略的区别在于，行为策略是智能体在环境中实际采取的策略，而目标策略是智能体希望学习的最优策略。¹ 行为策略和目标策略的差异会影响到强化学习算法的选择和性能。¹ 行为策略和目标策略都是强化学习中的重要概念。 (1) 强化学习中，确定性策略和随机策略的区别，以及各自经典的算法 ......

策略 policy Off-policy On-policy 行为更新时间 2023-03-24

共8篇 :1/1页 首页上一页1下一页尾页