RLHF · PBRL | SURF：使用半监督学习，对 labeled segment pair 进行数据增强-526互联

论文名称：SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning，ICLR 2022，分数 6 6 6 接收，又是 Pieter Abbeel 组的（恼）。
（最近的 reading list 里全是他们组的文章，已经读过了 PEBBLE 和 RUNE 并撰写阅读笔记）
感觉这篇跟 RUNE 一样，不太难的样子。（其实 PEBBLE 也不算难懂）
阅读材料：
- Open Review： https://openreview.net/forum?id=TfhfZLQ2EJO
- pdf 版本： https://arxiv.org/pdf/2203.10050.pdf
- html 版本： https://ar5iv.labs.arxiv.org/html/2203.10050

Open Review
0 abstract
2 related work
4 method： SURF
- 4.1 Semi-supervised reward learning - 半监督的 reward learning
- 4.2 Temporal data augmentation for reward learning - reward learning 中的时序数据增强
5 experiments

Open Review

贡献：
- semi-supervised learning + PBRL。
- 两部分：① 利用 pseudo-label 和 reference predictor 来整一些 artificial labels，② 裁剪（crop）连续的（consecutive）子序列（sub-sequences）来做 data augmentation。（感觉这两部分貌似是相互独立的）
- 实验：
  - 实验环境是 Meta-world 和 DMControl suite，结果表明性能显著提升。
  - SURF 仅 access 了少量 expert queries，性能可与 dense-reward SAC 相媲美。
- 关于 temporal cropping method：
  - 首先，抽取一对长为 50 的 segment，把它们作为 query 给 teacher 送去 label。
  - 然后，我们存储这些 segment，在左右两侧都有 5 的额外边距，即我们存储了长度 = 60 的 segment。
  - 在 reward learning 时，我们在 [Hmin, Hmax] = [45,55] 中，随机选择每个 segment k0,k1 的裁剪长度 H' 和起始位置，然后裁剪连续的 sub-sequences。
  - 详见 Algorithm 1。temporal cropping 的超参数详见 Appendix B。
优点：
- 实验量充足。formulation 清晰。性能很好。
缺点：
- reward function 是怎么学的，在第 3 和 4.1 节，reviewer 没太看懂。
- 如图 6(b) 所示，pseudo-labeling 技术要求超参数 τ 非常大，reviewer 在疑惑，为什么需要非常高的 confidence。这些 high-confidence samples 的 loss 应该非常小，为什么会让最终性能显著提升。（没太听懂）回答：这种高阈值的 pseudo-labeling 在半监督学习领域中，有充分的证明和广泛的应用。
- 有一个假设太强了：“augmentation 背后的直觉是，对于一对给定的 behavior clips，将它们 slightly shifted 或 resize，human teacher 可能仍然持有相同的 preference。” 反驳：CV 上相似 idea（图像裁剪）的效果很好。
- 技术上的 novelty 有限。
- （有两个 reviewer 说）ablation 可以多在几个 task 上做，不然对 TDA（好像是某个 task）的 support 是不够的。（然后就真的补 ablation 了）

0 abstract

Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor’s preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult to apply this approach to various applications. This data-efficiency problem, on the other hand, has been typically addressed by using unlabeled samples or data augmentation techniques in the context of supervised learning. Motivated by the recent success of these approaches, we present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. To further improve the label-efficiency of reward learning, we introduce a new data augmentation that temporally crops consecutive sub-sequences from the original behaviors. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the state-of-the-art preference-based method on a variety of locomotion and robotic manipulation tasks.

背景：
- 在没有昂贵的预定义 reward function 情况下，PBRL 已显示出教授 agent 执行目标任务的潜力。具体的，通过 human supervisor 在两种 agent behaviors 之间的 preference，来学习一个 reward model。
- 然而，PBRL 通常需要大量的人类反馈，因此很难广泛应用。
- 这种数据效率的问题，通常会在监督学习的背景下，使用未标记的样本（unlabeled samples）或数据增强（data augmentation）技术来解决。
method：
- 受这些方法启发，我们提出了 SURF，一种 semi-supervised reward learning framework，利用大量未标记的样本，进行 data augmentation。
- 具体的，为了利用 unlabeled samples 进行 reward learning，我们根据 preference predictor 的置信度（confidence），推断未标记样本的伪标签（pseudo-labels）。
- 为了进一步提高 reward learning 的 label-efficiency，我们引入了一种新的 data augmentation 技术，在时间上从 original behaviors 中 temporally crops consecutive sub-sequences。
实验：SURF 显著提高了最先进的 PBRL 算法在各种 locomotion 和 robot manipulation 任务上的 feedback-efficiency。

PBRL。
Data augmentation for RL（有趣的，以前没注意过的角度）
Semi-supervised learning 半监督学习：还是有很多 literature 的，不太了解这个领域…

4 method： SURF

SURF: a Semi-sUpervised Reward learning with data augmentation for Feedback-efficient preference-based RL.

感觉看一下 Algorithm 就可以了。

4.1 Semi-supervised reward learning - 半监督的 reward learning

pseudo-labeling：y hat(σ0, σ1) = 0 if P_ψ[σ0>σ1] > 0.5 else 1 。
为了过滤掉不准确的伪标签，只在 predictor 的 confidence 高于一个 pre-defined threshold 时，才使用 unlabeled samples 进行训练。（confidence 大概指的是，P_ψ[σ0>σ1] > τ，τ 是 confidence 的阈值）
（Algorithm 1，里面出现的 TDA temporal data augmentation 在 Algorithm 2 里）

4.2 Temporal data augmentation for reward learning - reward learning 中的时序数据增强

（Algorithm 2）
利用增强样本 \((\hat σ^0,\hat σ^1)\) 来优化公式 (5) 中的交叉熵损失。

5 experiments

Pieter Abbeel 组的 experiments section 经典问题：（如果你不知道经典问题指什么，可以看 PEBBLE RUNE 的本站博客；这三篇文章都是他们组的，写作非常相似）

How does SURF improve the existing preference-based RL method in terms of feedback efficiency?
SURF 如何在反馈效率方面，改进现有的 PBRL 方法？
What is the contribution of each of the proposed components in SURF?
SURF 中每个 proposed components 的贡献是什么？
How does the number of queries affect the performance of SURF?
queries 的数量如何影响 SURF 的性能？
Is temporal cropping better than existing state-based data augmentation methods in terms of feedback efficiency?
在 feedback efficiency 方面，temporal cropping 是否比现有的 state-based data augmentation 方法更好？
Can SURF improve the performance of preference-based RL methods when we operate on high-dimensional and partially observable inputs?
应对高维和 partially observable 的输入时，SURF 能否提高基于 PBRL 方法的性能？

implementation details：

对于 query selection 策略，我们选择 queries with high uncertainty，使用 disagreement-based sampling 方案，即 ensemble disagreement（Appendix B）。
更多细节见 Appendix B。

results：

相比 PEBBLE，surf 需要更少的 queries 数量。
在相同 queries 预算下，surf 可以显著提高 PEBBLE 的性能。
ablation 就是将两种技术分别使用，比较它们的训练 curve。
ablation 还比较了不同的 query size（是 feedback 数量，好像不是 segment 长度）、不同的 data augmentation 方法、不同的 surf 超参数。
在问题中画饼的“高维 partially observed input”，指的是 section 5.4 的 visual control tasks 嘛？（但是又在 6 discussion 中说是 future direction）