Paper: Informer

发布时间 2023-09-08 20:49:45作者: Hecto

Informer 时间序列模型

1 Introduction

3 significant limitations in LSTF

LSTF(Long sequence time-series forecasting)

  1. The quadratic computation of self-attention. The atom operation of self-attention mechanism, namely canonical dot-product, causes the time complexity and memory usage per layer to be O(L2).
  2. The memory bottleneck in stacking layers for long inputs. The stack of J encoder/decoder layers makes total memory usage to be O(J · L2), which limits the model scalability in receiving long sequence inputs.
  3. The speed plunge in predicting long outputs. Dynamic decoding of vanilla Transformer makes the step-by-step inference as slow as RNN-based model (Fig.(1b)).

prior works

  1. Vanilla Transformer(2017)
  2. The Sparse Transformer(2019)
  3. LogSparse Transformer(2019)
  4. Longformer(2020)
  5. Reformer(2019)
  6. Linformer(2020)
  7. Transformer-XL(2019)
  8. Compressive Transformer(2019)

2 Preliminary

3 Methodology

Efficient Self-attention Mechanism

query’s attention is defined as a kernel smoother in a probability form

\(\mathcal{A}(q, K, V) = \mathbb{E}_{p(k|q)[v]}\)

  • The Sparse Transformer

    “self-attention probability has potential sparsity” 自注意力概率具有潜在的稀疏性

Query Sparsity Measurement
  • a few dot-product pairs contribute to the major attention,

  • others generate trivial attention.

distinguish the “important” queries

  • Kullback-Leibler divergence
  • Dropping the constant,
  • query’s sparsity measurement
    • Log-Sum-Exp (LSE)
    • arithmetic mean
ProbSparse Self-attention
  • ProbSparse self-attention

    only attend to the u dominant queries

Encoder

  • extract dependency
Self-attention Distilling
  • distilling (inspired by dilated convolution)

    1. Attention Block
    2. Conv1d( )
    3. ELU( ) : activation function
    4. MaxPool

    reduce memory usage

Decoder

  • two identical multihead attention layers
Generative Inference
  1. sample a L_token long sequence
  2. take the known 5 days before the target sequence as “starttoken”
  3. feed the generative-style inference decoder
  4. one forward procedure predicts outputs
Loss function
  1. MSE loss function

4 Experiment

Datasets

2 collected real-world datasets for LSTF and 2 public benchmark datasets.

ETT (Electricity Transformer Temperature)

ECL (Electricity Consuming Load)

Weather

Experimental Details

Baselines:
  • ARIMA(2014)
  • Prophet(2018)
  • LSTMa(2015)
  • LSTnet(2018)
  • DeepAR(2017)

self-attention:

  • the canonical self-attention variant
  • Reformer(2019)
  • LogSparse self-attention(2019)
Metrics
  • MSE
  • MAE
Platform:
  • a single Nvidia V100 32GB GPU

Results and Analysis

Parameter Sensitivity

Ablation Study

Computation Efficiency

5 Conclusion