大语言模型生成模型的源码结构复习-526互联

modeling_gpt2.py:1099

        if labels is not None:
            # move labels to correct device to enable model parallelism
            labels = labels.to(lm_logits.device)
            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

这个是静电的生成模型. 看lm_logits[:,:-1,:], labels[:,1:]这个就是输入0,1,2,3,4
然后输出的特征shape 是 1, 5, 30000. 字典大小3w
labels也是0,1,2,3,4
所以网络输出的含义是: lm_logits[:,:-1,:]=1,2,3,4 所以跟labels[:,1:]=1,2,3,4 算交叉熵即可.
总结. 给gpt2或者所有的causallm模型喂入序列1,2,3,4那么他输出的shape大小是1,4,3w. 其中4的维度上第一个值表示的是1的后续token预测值在3w上的概率分布. 第二个值表示2的后续token在....,
用\(*\)表示估计值的语言来说就是 2,3,4,5.
所以你会经常看到代码取lm_logits[:,-1,:]再softmax就预测了下一个token是什么. 也就是从1,2,3,4预测得到了5.

参考代码:
D:\Users\admin\miniconda3\Lib\site-packages\transformers\generation\utils.py:2531行

next_token_logits = outputs.logits[:, -1, :]

            next_tokens = torch.argmax(next_tokens_scores, dim=-1)

            # finished sentences should have their next token be a padding token
            if eos_token_id is not None:
                if pad_token_id is None:
                    raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
                next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)

            # update generated ids, model inputs, and length for next step
            input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)

从这里面代码我们也很清楚看到每生成一个token,他就torch.cat到之前的input_ids里面.再循环生成.

            if stopping_criteria(input_ids, scores):
                this_peer_finished = True

一直到这个结尾判定成功就停止生成了.

以上就是gpt2的生成代码的全部分析了. 需要一定掌握.非常非常重要.