Off-Policy

off-policy RL | Advantage-Weighted Regression (AWR):组合先前策略得到新 base policy

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning 论文题目:Advantage-Weighted Regression: Simple and Scalable Off-Polic ......

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

郑重声明:原文参见标题,如有侵权,请联系作者,将会撤销发布! Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5331-5340, 2019 ......

Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling

![](https://img2023.cnblogs.com/blog/1428973/202308/1428973-20230812075327194-1111056360.png) **发表时间:**2020(ICML 2020) **文章要点:**这篇文章基于SAC做简单并且有效的改进来提升 ......

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

**发表时间:**2021 (NeurIPS 2021) **文章要点:**理论表明,更高的hindsight TD error,更加on policy,以及更准的target Q value的样本应该有更高的采样权重(The theory suggests that data with highe ......

Off-Policy Deep Reinforcement Learning without Exploration

**发表时间:**2019(ICML 2019) **文章要点:**这篇文章想说在offline RL的setting下,由于外推误差(extrapolation errors)的原因,标准的off-policy算法比如DQN,DDPG之类的,如果数据的分布和当前policy的分布差距很大的话,那就 ......

Learning Off-Policy with Online Planning

**发表时间:**2021(CoRL 2021) **文章要点:**这篇文章提出Off-Policy with Online Planning (LOOP)算法,将H-step lookahead with a learned model和terminal value function learne ......
Off-Policy Learning Planning Policy Online

Value targets in off-policy AlphaZero: a new greedy backup

**发表时间:**2021 **文章要点:**这篇文章给AlphaZero设计了一个新的value targets,AlphaZero with greedy backups (A0GB)。 AlphaZero的树里面有探索,而value又是所有结果的平均,所以并不准确。而选动作也是依概率选的,但真 ......
off-policy AlphaZero targets greedy backup

行为策略与目标策略、On-policy与Off-policy

在强化学习中,行为策略和目标策略的区别在于,行为策略是智能体在环境中实际采取的策略,而目标策略是智能体希望学习的最优策略。¹ 行为策略和目标策略的差异会影响到强化学习算法的选择和性能。¹ 行为策略和目标策略都是强化学习中的重要概念。 (1) 强化学习中,确定性策略和随机策略的区别,以及各自经典的算法 ......
策略 policy Off-policy On-policy 行为
共8篇  :1/1页 首页上一页1下一页尾页