How Attentive are Graph Attention Networks?-526互联

概
符号说明
GATv2
代码

Brody S., Alon U. and Yahav E. How attentive are graph attention networks? ICLR, 2022.

概

作者发现了 GAT 的 attention 并不能够抓住边的重要性, 于是提出了 GATv2.

符号说明

\(\mathcal{V} = \{1, \ldots, n\}\), node set;
\(\mathcal{E} \subset \mathcal{V} \times \mathcal{V}\), edge set;
\(\mathcal{G} = (\mathcal{V}, \mathcal{E})\), graph;
\(\mathcal{N}_i = \{j \in \mathcal{V}| (j, i) \in \mathcal{E}\}\);

GATv2

GAT 的流程如下 (一层):
- 计算边的关系:
  
  \[\tag{2} e(\bm{h}_i, \bm{h}_j) = \text{LeakyReLU}( \bm{a}^T \cdot [\bm{Wh}_i \| \bm{Wh}_j]), \]
  其中 \(\bm{a} \in \mathbb{R}^{2d'}, \bm{W} \in \mathbb{R}^{d' \times d}\).
- 计算边上的权重 (attention):
  
  \[\alpha_{ij} = \text{softmax}_j (e(\bm{h}_i, \bm{h}_j)) = \frac{ \exp(e(\bm{h}_i, \bm{h}_j)) }{ \sum_{j' \in \mathcal{N}_i \exp((e(\bm{h}_i, \bm{h}_{j'})))} }. \]
- 更新 (重加权):
  
  \[\bm{h}_i' = \sigma(\sum_{j \in \mathcal{N}_i} \alpha_{ij} \cdot \bm{W} \bm{h}_j). \]
GAT 的主要问题出现在 (2), 其实我们可以把 \(\bm{a} = [\bm{a}_1 \|\bm{a}_2], \: \bm{a}_1, \bm{a}_2 \in \mathbb{R}^{d'}\), 于是我们有

\[e(\bm{h}_i, \bm{h}_j) = \text{LeakyReLU}(\bm{a}_1^T \bm{Wh}_i + \bm{a}_2^T \bm{Wh}_j]), \]
由于 LeakyReLU, Softmax 的单调性, \(\alpha_{ij}\) 的相对大小实际上取决于:

\[\bm{\alpha}_2^T \bm{Wh}_j, \]
这也就是意味着, 假设 \(v\) 使得:

\[\bm{\alpha}_2^T \bm{Wh}_v = \arg\max_j \bm{\alpha}_2^T \bm{Wh}_j, \]
则对于任意的 \(i\),

\[\alpha_{iv} = \arg \max_j \alpha_{ij}. \]
所以此时 attention 并不是一个和 \(i\) 紧密相关的指标.
所以本文的将 (2) 改进为:

\[e(\bm{h}_i, \bm{h}_j) = \bm{a}^T \text{LeakyReLU}(\bm{W} \cdot [\bm{h}_i \| \bm{h}_j]). \]
这里 \(\bm{W} = [\bm{W}_l \| \bm{W}_r] \in \mathbb{R}^{d' \times 2d}\), \(\bm{W}_l, \bm{W}_r\) 可以共享参数或者独立.
Q: 为什么不采用普通的 attention 机制呢?

\[e(\bm{h}_i, \bm{h}_j) = \text{LeakyReLU}((\bm{W}_q \bm{h}_i)^T \bm{W}_k \bm{h}_j]). \]