A fast and simple algorithm for training neural probabilistic language models-526互联

概
Noise contrastive estimation

Mnih A. and Teh Y. W. A fast and simple algorithm for training neural probabilistic language models. ICML, 2012.

概

NCE 用在语言模型的训练上.

Noise contrastive estimation

给定 context \(h\), 下一个词为 \(w\) 的条件概率按照如下方式定义:

\[ P_{\theta}(w|h) = \frac{\exp(s_{\theta}(w, h))}{\sum_{w'} \exp(s_{\theta}(w', h))}, \]
作者认为, 当词表过大的时候, normalizing term \(Z^h = \sum_{w'} \exp(s_{\theta}(w', h))\) 的计算过于消耗时间了. 所以本文求助 NCE 来解决这一个问题.
对于这类问题, NCE 的处理方式是设计一个二分类任务:

\[P(C=1|w, h; \theta) = \frac{P_{\theta}(w|h)}{P_{\theta}(w|h) + k P_n(w|h)}, \]
其中 \(P_n(w|h)\) 是一个噪声分布, \(k\) 表示采样过程中, \(w\) 采样自真实分布和噪声分布的比例为 \(1:k\).
令 \(c^h=\ln Z^h\), 我们有

\[\ln P_{\theta}(w|h) = s_{\theta}(w, h) - c. \]
此时

\[P(C=1|w; \theta) = \sigma(s'_{\theta}(w, h)), \\ s'_{\theta}(w, h ) = s_{\theta}(w, h) - c^h - \ln kP_n(w|h). \]
NCE 将 \(c^h\) 也作为一个参数训练, 然后具体的损失为 (对于每个 \(h\)):

\[ -\mathbb{E}_{w \sim P(w|h)} \log \sigma(s'_{\theta}(w, h)) - k \mathbb{E}_{w \sim P_n(w|h)} \log(1 - \sigma(s'_{\theta}(w|h))). \]
这里的一个问题是, 自然语言里的 context \(h\) 太多了, 所以很难说给每个 \(c^h\) 都设为一个参数去学习, 作者发现, \(c^h \equiv 0\) 实验中的效果就很好. 故而, 实际中我们所采用的为:

\[s'_{\theta}(w, h) = s_{\theta}(w, h) - \ln kP_n(w|h). \]
特别地, 如果我们采取一种最简单的噪声分布, 即 \(P_n(w|h) = P_n(w) = \frac{1}{N}\) (\(N\) 为词的个数), 我们有:

\[s'_{\theta}(w, h) = s_{\theta}(w, h) - \ln \frac{k}{N}. \]
进一步地, 我们可以把 \(-\ln \frac{k}{N}\) 也省略, 只要我们相信 \(s_{\theta}(w, h)\) 本身有能力意识到这一点. 实际上, 这也是 Word2Vec 中的 NEG (Negative sampling) 的做法.

probabilistic algorithm language training

instructgpt instructions training language

vision-language manipulation pre-trained open-world

vision-language pre-training embodiedgpt embodied

instructions training language feedback

language-image pre-training grounded language

模态language-image pre-training referring

probabilistic

probabilistic perspective geometric detecting

probabilistic efficient framework embraces