Attention Is All You Need

* Authors: [[Ashish Vaswani]], [[Noam Shazeer]], [[Niki Parmar]], [[Jakob Uszkoreit]], [[Llion Jones]], [[Aidan N. Gomez]], [[Lukasz Kaiser]], [[Illia Polosukhin]]

DOI: 10.48550/ARXIV.1706.03762

初读印象

comment:: 仅仅利用了注意力机制的Sequence to Sequence的经典模型。

动机

那时的时间序列模型通常使用RNN。
RNN的缺点：时序是一步一步计算的，难以并行。内存开销大。
以前的attention通常用于研究如何将编码器的信息传递给解码器

纯attention的并行度比较高。

方法

Mask Attention

Pasted image 20221019151644 Pastedimage 20221019151825

图中的mask是为了防止t时刻看到t以后的内容，具体做法为：t时刻后的key都换成非常大的负数，softmax后对应的关联度就会变为0，那样就会屏蔽t时刻后的value。

multi-head self-attention mechanism

Pasted image 20221019161953 Pastedimage 20221019162014

使用不同的W将Q、K、V投影到不同的距离空间中。

position encoding

attention没有时序信息，打乱顺序对亲和力矩阵的计算没有影响。
通过不同周期的一个余弦和正弦函数，为不同位置的向量做编码，然后直接将位置编码加到目标向量上。
Pasted image 20221019165338 ####整体架构

Pasted image 20221019165448

attention need all you

transformer attention need all

attention笔记need all

attention transform gt need

backend need llm all

all-you-can-eat

all-you-can-eat 145e abc all

need

devices smart cars need

opengauss repair need

526互联

Attention Is All You Need

Attention Is All You Need

初读印象

动机

方法

Mask Attention

multi-head self-attention mechanism

position encoding