Bert【1】-基础-526互联

2018年的10月11日，Google发布的论文《Pre-training of Deep Bidirectional Transformers for Language Understanding》，

成功在 11 项 NLP 任务中取得 state of the art 的结果，赢得自然语言处理学界的一片赞誉之声，BERT就出自该论文，

BERT模型的全称是 Bidirectional Encoder Representations from Transformers；

Bert 模型

Bert 只使用了 transformers 的 encode 模块，属于自编码语言模型，

论文中，作者分别用 12层和 24层 transformers encoder 组装了两套 bert模型，分别是

层的数量(Transformer Encoder 块的数量)为L ，隐藏层的维度为H ，自注意头的个数为A；

在所有例子中，我们将前馈/过滤器(Transformer Encoder 端的 feed-forward 层)的维度设置为4H ，即当 H=768 时是3072 ；当 H=1024 是 4096；

网络结构如下

模型输入为

Token emd：文本中各个字/词的初始向量，可以是随机初始，也可以使用 word2vec 进行初始化【为方便描述且与 BERT 模型的当前中文版本保持一致，统一以「字向量」作为输入】

Segment emd：文本向量，不同于单个字，该 emd 学习了全局语义信息，通过训练得到，一般初始化为 111222

Position emd：由于出现在文本不同位置的字/词所携带的语义信息存在差异（比如：“我爱你”和“你爱我”），因此，BERT 模型对不同位置的字/词分别附加一个不同的向量以作区分，通过训练得到

3个 emd 相加 sum 作为 input；

模型输出 包含输入各字对应的融合全文语义信息后的向量表示

下面代码截取自 Bert官方源码，大致是如何把原始数据转换成符合模型输入的格式，从源码看出 Bert有两种(代码加粗)输入格式；

def convert_single_example(ex_index, example, label_list, max_seq_length,
                           tokenizer):
    """Converts a single `InputExample` into a single `InputFeatures`."""

    if isinstance(example, PaddingInputExample):
        return InputFeatures(
            input_ids=[0] * max_seq_length,
            input_mask=[0] * max_seq_length,
            segment_ids=[0] * max_seq_length,
            label_id=0,
            is_real_example=False)

    label_map = {}
    for (i, label) in enumerate(label_list):
        label_map[label] = i

    tokens_a = tokenizer.tokenize(example.text_a)
    tokens_b = None
    if example.text_b:
        tokens_b = tokenizer.tokenize(example.text_b)

    if tokens_b:
        # Modifies `tokens_a` and `tokens_b` in place so that the total
        # length is less than the specified length.
        # Account for [CLS], [SEP], [SEP] with "- 3"
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
    else:
        # Account for [CLS] and [SEP] with "- 2"
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[0:(max_seq_length - 2)]

    # The convention in BERT is:
    # (a) For sequence pairs:
    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
    # (b) For single sequences:
    #  tokens:   [CLS] the dog is hairy . [SEP]
    #  type_ids: 0     0   0   0  0     0 0
    #
    # Where "type_ids" are used to indicate whether this is the first
    # sequence or the second sequence. The embedding vectors for `type=0` and
    # `type=1` were learned during pre-training and are added to the wordpiece
    # embedding vector (and position vector). This is not *strictly* necessary
    # since the [SEP] token unambiguously separates the sequences, but it makes
    # it easier for the model to learn the concept of sequences.
    #
    # For classification tasks, the first vector (corresponding to [CLS]) is
    # used as the "sentence vector". Note that this only makes sense because
    # the entire model is fine-tuned.
    tokens = []
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]")
    segment_ids.append(0)

    if tokens_b:
        for token in tokens_b:
            tokens.append(token)
            segment_ids.append(1)
        tokens.append("[SEP]")
        segment_ids.append(1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    label_id = label_map[example.label]

    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_id=label_id,
        is_real_example=True)
    return feature

下面代码截取自 Bert官方源码，大致是在 Bert模型输出后加上全连接层进行下游任务，该代码可以判断 Bert的输出是Emd；

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
    """Creates a classification model."""
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

    # In the demo, we are doing a simple classification task on the entire
    # segment.
    #
    # If you want to use the token-level output, use model.get_sequence_output()
    # instead.
    output_layer = model.get_pooled_output()

    hidden_size = output_layer.shape[-1].value

    output_weights = tf.get_variable(
        "output_weights", [num_labels, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=0.02))

    output_bias = tf.get_variable(
        "output_bias", [num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):
        if is_training:
            # I.e., 0.1 dropout
            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        probabilities = tf.nn.softmax(logits, axis=-1)
        log_probs = tf.nn.log_softmax(logits, axis=-1)

        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        loss = tf.reduce_mean(per_example_loss)

        return (loss, per_example_loss, logits, probabilities)

Bert 预训练

作者设计了两个任务来预训练模型，

预训练的目标是构建语言模型，P(我爱吃饭)=P(我|爱吃饭)P(爱|吃饭)P(吃|饭)

BERT模型采用的是bidirectional Transformer，为什么需要双向呢？因为在预训练模型处理下游任务时，不仅需要该词左侧的语言信息，还需要右侧的语言信息

MLM

随机掩盖部分输入词，然后基于上下文对被掩盖的词进行预测；

在实际训练过程中，每次从序列中随机选出15% token用于masked，也就是每次只预测15%的词，而不是像word2vec中的cbow预测所有词；

在被选中的 token 中，80%用 MASK 替代，10%保持不变，10%随机选一个token替代原来的token；

这个任务类似于人类语言学习中的《完形填空》任务

NSP：Next Sentence Prediction

预测两个句子是否连续

样本如下

1. 从训练语料库中取出两个连续的句子作为正样本

2.从不同的文档中随机各取一个句子作为负样本

缺点：主题预测和连贯性预测合并为一个单项任务

这个任务类似于人类学习语言中的《段落重排》任务

Bert 模型通过 Masked ML 任务和 NSP 任务联合训练，使模型输出的每个字/词的向量都尽可能全面、准确地刻画输入文本的整体信息，为后续的微调任务提供更好的模型参数初始值。

局限性

1. Bert在MLM训练任务中，把多个词MASK掉，并且认为这些词相互独立，然而有时候并不是独立的，比如我爱吃饭变成我爱MASK MASK，吃和饭本身是有关系的

2. BERT 的在预训练时会出现特殊的[MASK]，但是它在下游的 fine-tune 中不会出现，这就出现了预训练阶段和 fine-tune 阶段不一致的问题

但这两个问题在经过海量语料库训练后会得到缓解，对模型整体效果影响不大

下游任务 finetuning

目前将预训练的语言模型应用到NLP任务主要有两种策略，

一种是基于特征的语言模型，如ELMo模型；另一种是基于微调的语言模型，如OpenAI GPT。

这两类语言模型各有其优缺点，BERT基本上融合了它们的优点，因此才可以在诸多后续特定任务上取得最优的效果。

多标签分类
如输入 一件 L 尺寸的棉服，输出两个标签——型号：L，类型：冬装
BERT 模型解决多标签分类问题时，其输入与普通单标签分类问题一致，得到其 embedding 表示之后(也就是 BERT 输出层的 embedding)，
有几个 label 就连接到几个全连接层(也可以称为 projection layer)，然后再分别接上 softmax 分类层，最后再将所有的 loss 相加起来即可。
这种做法就相当于将 n 个分类模型的特征提取层参数共享，得到一个共享的表示(其维度可以视任务而定，由于是多标签分类任务，因此其维度可以适当
增大一些)，最后再做多标签分类任务。