Informer 时间序列模型

1 Introduction

3 significant limitations in LSTF

LSTF(Long sequence time-series forecasting)

The quadratic computation of self-attention. The atom operation of self-attention mechanism, namely canonical dot-product, causes the time complexity and memory usage per layer to be O(L2).
The memory bottleneck in stacking layers for long inputs. The stack of J encoder/decoder layers makes total memory usage to be O(J · L2), which limits the model scalability in receiving long sequence inputs.
The speed plunge in predicting long outputs. Dynamic decoding of vanilla Transformer makes the step-by-step inference as slow as RNN-based model (Fig.(1b)).

prior works

Vanilla Transformer(2017)
The Sparse Transformer(2019)
LogSparse Transformer(2019)
Longformer(2020)
Reformer(2019)
Linformer(2020)
Transformer-XL(2019)
Compressive Transformer(2019)

2 Preliminary

3 Methodology

Efficient Self-attention Mechanism

query’s attention is defined as a kernel smoother in a probability form

\(\mathcal{A}(q, K, V) = \mathbb{E}_{p(k|q)[v]}\)

The Sparse Transformer

“self-attention probability has potential sparsity” 自注意力概率具有潜在的稀疏性

Query Sparsity Measurement

a few dot-product pairs contribute to the major attention,
others generate trivial attention.

distinguish the “important” queries

Kullback-Leibler divergence

Dropping the constant,

query’s sparsity measurement

Log-Sum-Exp (LSE)

arithmetic mean