Chinese-Text-Classification-PyTorch -526互联

Chinese-Text-Classification

Github项目地址：

https://github.com/JackHCC/Chinese-Text-Classification-PyTorch

作者：JackHCC

链接：https://www.jianshu.com/p/9438fd0fea8c

https://www.jianshu.com/p/9438fd0fea8c

中文文本分类，基于pytorch，开箱即用。

神经网络模型：TextCNN，TextRNN，FastText，TextRCNN，BiLSTM_Attention, DPCNN, Transformer
预训练模型：Bert，ERNIE

模型介绍、数据流动过程：参考

数据以字为单位输入模型，预训练词向量使用搜狗新闻 Word+Character 300d，点这里下载

模型	介绍
TextCNN	Kim 2014 经典的CNN文本分类
TextRNN	BiLSTM
TextRNN_Att	BiLSTM+Attention
TextRCNN	BiLSTM+池化
FastText	bow+bigram+trigram，效果出奇的好
DPCNN	深层金字塔CNN
Transformer	效果较差

预训练模型

模型	介绍	备注
bert	原始的bert
ERNIE	ERNIE
bert_CNN	bert作为Embedding层，接入三种卷积核的CNN	bert + CNN
bert_RNN	bert作为Embedding层，接入LSTM	bert + RNN
bert_RCNN	bert作为Embedding层，通过LSTM与bert输出拼接，经过一层最大池化层	bert + RCNN
bert_DPCNN	bert作为Embedding层，经过一个包含三个不同卷积特征提取器的region embedding层，可以看作输出的是embedding，然后经过两层的等长卷积来为接下来的特征抽取提供更宽的感受眼，（提高embdding的丰富性），然后会重复通过一个1/2池化的残差块，1/2池化不断提高词位的语义，其中固定了feature_maps,残差网络的引入是为了解决在训练的过程中梯度消失和梯度爆炸的问题。	bert + DPCNN

环境

python 3.7
pytorch 1.1
tqdm
sklearn
tensorboardX

中文数据集

我从THUCNews中抽取了20万条新闻标题，已上传至github，文本长度在20到30之间。一共10个类别，每类2万条。数据以字为单位输入模型。

类别：财经、房产、股票、教育、科技、社会、时政、体育、游戏、娱乐。

数据集划分：

数据集	数据量
训练集	18万
验证集	1万
测试集	1万

更换数据集

按照THUCNews数据集的格式来格式化自己的中文数据集。
对于神经网络模型：
- 如果用字，按照数据集的格式来格式化你的数据。
- 如果用词，提前分好词，词之间用空格隔开，python run.py --model TextCNN --word True
- 使用预训练词向量：utils.py的main函数可以提取词表对应的预训练词向量。

实验效果

机器：一块2080Ti ，训练时间：30分钟。

模型	acc	备注
TextCNN	91.22%	Kim 2014 经典的CNN文本分类
TextRNN	91.12%	BiLSTM
TextRNN_Att	90.90%	BiLSTM+Attention
TextRCNN	91.54%	BiLSTM+池化
FastText	92.23%	bow+bigram+trigram，效果出奇的好
DPCNN	91.25%	深层金字塔CNN
Transformer	89.91%	效果较差
bert	94.83%	单纯的bert
ERNIE	94.61%	说好的中文碾压bert呢
bert_CNN	94.44%	bert + CNN
bert_RNN	94.57%	bert + RNN
bert_RCNN	94.51%	bert + RCNN
bert_DPCNN	94.47%	bert + DPCNN

原始的bert效果就很好了，把bert当作embedding层送入其它模型，效果反而降了，之后会尝试长文本的效果对比。

预训练语言模型

bert模型放在 bert_pretain目录下，ERNIE模型放在ERNIE_pretrain目录下，每个目录下都是三个文件：

pytorch_model.bin 模型
bert_config.json 配置
vocab.txt 词表

预训练模型下载地址：

bert_Chinese: 模型 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz
词表 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt

来自这里

备用：模型的网盘地址：https://pan.baidu.com/s/1qSAD5gwClq7xlgzl_4W3Pw

ERNIE_Chinese: https://pan.baidu.com/s/1lEPdDN1-YQJmKEd_g9rLgw

来自这里

解压后，按照上面说的放在对应目录下，文件名称确认无误即可。

神经网络方法
# 训练并测试，训练模型+验证+测试：
# TextCNN
python run.py --model TextCNN

# TextRNN
python run.py --model TextRNN

# TextRNN_Att
python run.py --model TextRNN_Att

# TextRCNN
python run.py --model TextRCNN

# FastText, embedding层是随机初始化的
python run.py --model FastText --embedding random

# DPCNN
python run.py --model DPCNN

# Transformer
python run.py --model Transformer

预训练方法
下载好预训练模型（三个文件，模型，配置，词表）就可以跑了：
# 预训练模型训练并测试：
# bert
python pretrain_run.py --model bert

# bert + 其它
python pretrain_run.py --model bert_CNN

# ERNIE
python pretrain_run.py --model ERNIE
预测

预训练模型：
python pretrain_predict.py

神经网络模型：
python predict.py

参数
模型都在models目录下，超参定义和模型定义在同一文件中。

======================

使用记录：

pip install tensorboardX -i https://pypi.douban.com/simple/

pip list

run.py

parser.add_argument('--model', type=str, required=True, default="TextCNN", help='choose a model: TextCNN, TextRNN, FastText, TextRCNN, TextRNN_Att, DPCNN, Transformer')

parser.add_argument('--model', type=str, default='TextCNN', help='choose a model: TextCNN, TextRNN, FastText, TextRCNN, TextRNN_Att, DPCNN, Transformer')

运行结果：

E:\Eprogramfiles\Anaconda3\python.exe E:/other/lightvit/Chinese-Text-Classification-PyTorch/run.py
Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure.
                 It can be downloaded at https://aka.ms/vs/16/release/vc_redist.x64.exe
Loading data...
0it [00:00, ?it/s]Vocab size: 4762
26612it [00:00, 83118.42it/s]
10000it [00:00, 92071.41it/s]
10000it [00:00, 66401.82it/s]
Time usage: 0:00:01
<bound method Module.parameters of Model(
(embedding): Embedding(4762, 300)
(convs): ModuleList(
    (0): Conv2d(1, 256, kernel_size=(2, 300), stride=(1, 1))
    (1): Conv2d(1, 256, kernel_size=(3, 300), stride=(1, 1))
    (2): Conv2d(1, 256, kernel_size=(4, 300), stride=(1, 1))
)
(dropout): Dropout(p=0.5, inplace=False)
(fc): Linear(in_features=768, out_features=10, bias=True)
)>
Epoch [1/20]
Iter:      0, Train Loss:   2.3, Train Acc: 12.50%, Val Loss:   2.6, Val Acc: 11.05%, Time: 0:00:06 *
Iter:    100, Train Loss: 0.73, Train Acc: 73.44%, Val Loss: 0.69, Val Acc: 78.57%, Time: 0:00:32 *
Iter:    200, Train Loss: 0.68, Train Acc: 79.69%, Val Loss: 0.55, Val Acc: 83.12%, Time: 0:00:59 *
Epoch [2/20]
Iter:    300, Train Loss: 0.41, Train Acc: 88.28%, Val Loss:   0.5, Val Acc: 84.09%, Time: 0:01:27 *
Iter:    400, Train Loss: 0.55, Train Acc: 83.59%, Val Loss: 0.47, Val Acc: 85.14%, Time: 0:01:54 *
Epoch [3/20]
Iter:    500, Train Loss: 0.45, Train Acc: 83.59%, Val Loss: 0.45, Val Acc: 86.11%, Time: 0:02:22 *
Iter:    600, Train Loss: 0.43, Train Acc: 88.28%, Val Loss: 0.45, Val Acc: 86.16%, Time: 0:02:50
Epoch [4/20]
Iter:    700, Train Loss: 0.19, Train Acc: 93.75%, Val Loss: 0.46, Val Acc: 85.80%, Time: 0:03:17
Iter:    800, Train Loss:   0.2, Train Acc: 94.53%, Val Loss: 0.44, Val Acc: 86.71%, Time: 0:03:53 *
Epoch [5/20]
Iter:    900, Train Loss: 0.17, Train Acc: 96.09%, Val Loss: 0.47, Val Acc: 86.17%, Time: 0:04:25
Iter:   1000, Train Loss: 0.17, Train Acc: 96.09%, Val Loss: 0.49, Val Acc: 85.89%, Time: 0:04:54
Epoch [6/20]
Iter:   1100, Train Loss: 0.12, Train Acc: 97.66%, Val Loss: 0.47, Val Acc: 86.29%, Time: 0:05:22
Iter:   1200, Train Loss: 0.092, Train Acc: 97.66%, Val Loss: 0.47, Val Acc: 86.80%, Time: 0:05:52
Epoch [7/20]
Iter:   1300, Train Loss: 0.075, Train Acc: 98.44%, Val Loss: 0.47, Val Acc: 86.43%, Time: 0:06:21
Iter:   1400, Train Loss: 0.051, Train Acc: 99.22%, Val Loss:   0.5, Val Acc: 86.38%, Time: 0:06:49
Epoch [8/20]
Iter:   1500, Train Loss: 0.065, Train Acc: 97.66%, Val Loss:   0.5, Val Acc: 86.47%, Time: 0:07:19
Iter:   1600, Train Loss: 0.043, Train Acc: 99.22%, Val Loss: 0.54, Val Acc: 86.24%, Time: 0:07:47
Epoch [9/20]
Iter:   1700, Train Loss: 0.039, Train Acc: 99.22%, Val Loss: 0.52, Val Acc: 86.78%, Time: 0:08:17
Iter:   1800, Train Loss: 0.04, Train Acc: 99.22%, Val Loss: 0.56, Val Acc: 86.25%, Time: 0:08:45
No optimization for a long time, auto-stopping...
Test Loss: 0.42, Test Acc: 87.18%
Precision, Recall and F1-Score...
               precision    recall f1-score   support

      finance     0.8907    0.8560    0.8730      1000
       realty     0.9063    0.8900    0.8981      1000
       stocks     0.8373    0.7720    0.8033      1000
    education     0.9414    0.9310    0.9361      1000
      science     0.7986    0.8090    0.8038      1000
      society     0.8718    0.8980    0.8847      1000
     politics     0.8253    0.8740    0.8490      1000
       sports     0.8861    0.9410    0.9127      1000
         game     0.9092    0.8510    0.8791      1000
entertainment     0.8566    0.8960    0.8759      1000

     accuracy                         0.8718     10000
    macro avg     0.8723    0.8718    0.8716     10000
weighted avg     0.8723    0.8718    0.8716     10000

Confusion Matrix...
[[856 19 55   1 17 13 17 16   1   5]
[ 13 890 20   1 16 16 14   8   3 19]
[ 63 28 772   8 53   4 48   6 10   8]
[ 0   3   3 931 10 15 10 11   3 14]
[ 7   8 37   6 809 20 44 10 36 23]
[ 4 17   3 20   9 898 27   5   5 12]
[ 12   9 21   7 16 36 874   7   4 14]
[ 2   1   1   2   6   8   7 941   3 29]
[ 1   1   9   4 66   9   8 25 851 26]
[ 3   6   1   9 11 11 10 33 20 896]]
Time usage: 0:00:06

Process finished with exit code 0
=======================================================
{
"attention_probs_dropout_prob": 0.1, #乘法attention时，softmax后dropout概率
"hidden_act": "gelu", #激活函数
"hidden_dropout_prob": 0.1, #隐藏层dropout概率
"hidden_size": 768, #隐藏单元数
"initializer_range": 0.02, #初始化范围
"intermediate_size": 3072, #升维维度
"max_position_embeddings": 512,#一个大于seq_length的参数，用于生成position_embedding

"num_attention_heads": 12, #每个隐藏层中的attention head数
"num_hidden_layers": 12, #隐藏层数
"type_vocab_size": 2, #segment_ids类别 [0,1]
"vocab_size": 30522 #词典中词数
}

=======================================================

vocab_size：词表大小
hidden_size：隐藏层神经元数，可以理解为dmodel，即单个Transformer block第一层(输入层后面链接的层)和最后一层(输出层)的节点数，对应于论文中的H
num_hidden_layers：Transformer 的层数，对应于论文中的L
num_attention_heads：multi-head attention 的 head 数，对应于论文中的A
intermediate_size：encoder 的“中间”隐层神经元数（例如 feed-forward layer），对应于论文中的4H。
hidden_act：隐藏层激活函数
hidden_dropout_prob：隐层 dropout 率
attention_probs_dropout_prob：注意力部分的 dropout
max_position_embeddings：最大位置编码
type_vocab_size：token_type_ids 的词典大小
initializer_range：truncated_normal_initializer 初始化方法的 stdev
这里要注意一点，可能刚看的时候对type_vocab_size这个参数会有点不理解，其实就是在next sentence prediction任务里的Segment A和 Segment B。在下载的bert_config.json文件里也有说明，默认值应该为 2。

=======================================================

预训练模型

由于从头开始(from scratch)训练需要巨大的计算资源，因此Google提供了预训练的模型(的checkpoint)，目前包括英语、汉语和多语言3类模型：

    BERT-Base, Uncased：12层，768隐藏，12头，110M参数
    BERT-Large, Uncased：24层，1024个隐藏，16个头，340M参数
    BERT-Base, Cased：12层，768隐藏，12头，110M参数
    BERT-Large, Cased：24层，1024个隐藏，16个头，340M参数
    BERT-Base, Multilingual Cased (New, recommended)：104种语言，12层，768隐藏，12头，110M参数
    BERT-Base, Multilingual Uncased (Orig, not recommended) （不推荐使用，Multilingual Cased代替使用）：102种语言，12层，768隐藏，12头，110M参数
    BERT-Base, Chinese：中文简体和繁体，12层，768隐藏，12头，110M参数

Uncased的意思是在预处理的时候都变成了小写，而cased是保留大小写。

这么多版本应该如何选择呢？

如果我们处理的问题只包含英文，那么我们应该选择英语的版本(模型大效果好但是参数多训练慢而且需要更多内存/显存)。如果我们只处理中文，那么应该使用中文的版本。如果是其他语言就使用多语言的版本。
链接：https://blog.csdn.net/jiaowoshouzi/article/details/89388794/

=======================================================

代码：

pretrain_run.py

#parser.add_argument('--model', type=str, required=True, help='choose a model: Bert, ERNIE')
parser.add_argument('--model', type=str, default='bert', help='choose a model: bert, ERNIE')

数据：

保存模型：

运行结果：

预测？

=======================================================

save

torch.save(model.state_dict(), PATH)
load

model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval() # 当用于inference时不要忘记添加

链接：https://blog.csdn.net/qq_46497008/article/details/111682674

=======================================================

bert_config_file：必要参数，bert配置文件，描述了模型的结构
input_file：必要参数，输入文件
output_dir：必要参数，输出地址
init_checkpoint：初始化时的checkpoint，通常是预训练好的bert模型或者没有
max_seq_length：最大序列长度
max_predictions_per_seq：每个序列最大预测数量，或者说每个序列最多被遮蔽的数量
do_train：是否进行训练
do_eval：是否进行验证
train_batch_size：训练btach size
eval_batch_size：验证batch size
learning_rate：学习率
num_train_steps：训练步数
num_warmup_steps：预热步数
save_checkpoints_steps：多少步保存一次checkpoint
max_eval_steps：验证的最大步数
use_tpu：是否使用tpu
其他TPU相关的参数

https://zhuanlan.zhihu.com/p/161740210

=======================================================

E:\Eprogramfiles\Anaconda3\python.exe E:/other/lightvit/Chinese-Text-Classification-PyTorch/pretrain_run.py
Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure.
                 It can be downloaded at https://aka.ms/vs/16/release/vc_redist.x64.exe
2023-06-05 16:11:48
Loading data...
1471it [00:00, 8104.06it/s]
301it [00:00, 9145.20it/s]
291it [00:00, 7678.44it/s]
Time usage: 0:00:00
Epoch [1/3]
E:\other\lightvit\Chinese-Text-Classification-PyTorch\pytorch_pretrained\optimization.py:275: UserWarning: This overload of add_ is deprecated:
   add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
   add_(Tensor other, *, Number alpha) (Triggered internally at C:\Users\builder\tkoch\workspace\pytorch\pytorch_1647970138273\work\torch\csrc\utils\python_arg_parser.cpp:1052.)
next_m.mul_(beta1).add_(1 - beta1, grad)
Iter:      0, Train Loss:   2.4, Train Acc: 10.16%, Val Loss:   2.4, Val Acc: 9.38%, Time: 0:00:28 *
Epoch [2/3]
Epoch [3/3]
Traceback (most recent call last):
File "E:/other/lightvit/Chinese-Text-Classification-PyTorch/pretrain_run.py", line 44, in <module>
    train(config, model, train_iter, dev_iter, test_iter)
File "E:\other\lightvit\Chinese-Text-Classification-PyTorch\pretrain_eval.py", line 82, in train
    test(config, model, test_iter)
File "E:\other\lightvit\Chinese-Text-Classification-PyTorch\pretrain_eval.py", line 90, in test
    test_acc, test_loss, test_report, test_confusion = evaluate(config, model, test_iter, test=True)
File "E:\other\lightvit\Chinese-Text-Classification-PyTorch\pretrain_eval.py", line 118, in evaluate
    report = metrics.classification_report(labels_all, predict_all, target_names=config.class_list, digits=4)
File "E:\Eprogramfiles\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
File "E:\Eprogramfiles\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1950, in classification_report
    raise ValueError(
ValueError: Number of classes, 7, does not match size of target_names, 10. Try specifying the labels parameter

Process finished with exit code 1

=======================================================

E:\Eprogramfiles\Anaconda3\python.exe E:/other/lightvit/Chinese-Text-Classification-PyTorch/pretrain_run.py
Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure.
                 It can be downloaded at https://aka.ms/vs/16/release/vc_redist.x64.exe
2023-06-05 16:36:36
Loading data...
1471it [00:00, 8254.76it/s]
301it [00:00, 5172.79it/s]
592it [00:00, 6570.30it/s]
Time usage: 0:00:00
Epoch [1/3]
E:\other\lightvit\Chinese-Text-Classification-PyTorch\pytorch_pretrained\optimization.py:275: UserWarning: This overload of add_ is deprecated:
   add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
   add_(Tensor other, *, Number alpha) (Triggered internally at C:\Users\builder\tkoch\workspace\pytorch\pytorch_1647970138273\work\torch\csrc\utils\python_arg_parser.cpp:1052.)
next_m.mul_(beta1).add_(1 - beta1, grad)
Iter:      0, Train Loss:   2.4, Train Acc: 10.16%, Val Loss:   2.4, Val Acc: 9.38%, Time: 0:00:27 *
Epoch [2/3]
Epoch [3/3]
E:\Eprogramfiles\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Test Loss:   2.4, Test Acc: 4.92%
Precision, Recall and F1-Score...
               precision    recall f1-score   support

      finance     0.0142    0.0541    0.0225        37
       realty     0.0000    0.0000    0.0000        36
       stocks     0.0000    0.0000    0.0000        29
    education     0.1667    0.0031    0.0060       325
      science     0.0000    0.0000    0.0000        27
      society     0.0625    0.8966    0.1169        29
     politics     0.0000    0.0000    0.0000        29
       sports     0.0000    0.0000    0.0000        26
         game     0.0000    0.0000    0.0000        24
entertainment     0.0000    0.0000    0.0000        27

     accuracy                         0.0492       589
    macro avg     0.0243    0.0954    0.0145       589
weighted avg     0.0959    0.0492    0.0105       589

Confusion Matrix...
[[ 2   1   0   3   0 31   0   0   0   0]
[ 3   0   0   0   0 33   0   0   0   0]
[ 3   1   0   0   0 24   1   0   0   0]
[110   1   0   1   0 197   1   0 13   2]
[ 4   2   0   0   0 19   1   0   0   1]
[ 3   0   0   0   0 26   0   0   0   0]
[ 8   0   0   1   0 20   0   0   0   0]
[ 4   0   0   0   0 22   0   0   0   0]
[ 4   0   0   1   0 19   0   0   0   0]
[ 0   1   0   0   1 25   0   0   0   0]]
Time usage: 0:00:23
run time: 625 seconds
2023-06-05 16:47:01

Process finished with exit code 0

=======================================================

https://huggingface.co/bert-base-chinese/tree/main

=======================================================

模型的输入

下面介绍一个核心的参数，更多详细内容可以查看官方文档：

    input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — 词汇表中输入序列标记的索引
    attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — 对输入数据进行mask，解决pad问题.在 [0, 1] 中选择的掩码值:1 表示未屏蔽的标记，0 表示已屏蔽的标记
    token_type_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — 分段标记索引以指示输入的第一和第二部分。在 [0, 1] 中选择索引:0对应一个句子A的token，1对应一个句子B的token。

{'input_ids': [101, 2769, 1762, 3844, 6407, 8815, 8716, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

{'input_ids': [101, 2769, 1762, 3844, 6407, 8815, 8716, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', '我', '在', '测', '试', 'be', '##rt', '[SEP]']

模型的输出

( last_hidden_state: FloatTensor = None,
pooler_output: FloatTensor = None,
hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None,
past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None,
attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None,
cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

https://huggingface.co/docs/transformers/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions

last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — 形状为（batch_size、sequence_length、hidden_size）的torch.FloatTensor 的元组（一个用于嵌入的输出，如果模型有嵌入层，+ 一个用于每一层的输出）
————————————————
版权声明：本文为CSDN博主「科皮子菊」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/meiqi0538/article/details/124891560

=======================================================

输出是一个元组类型的数据，包含四部分，

last hidden state shape是(batch_size, sequence_length, hidden_size)，hidden_size=768,它是模型最后一层的隐藏状态

pooler_output：shape是(batch_size, hidden_size)，这是序列的第一个token (cls) 的最后一层的隐藏状态，它是由线性层和Tanh激活函数进一步处理的，这个输出不是对输入的语义内容的一个很好的总结，对于整个输入序列的隐藏状态序列的平均化或池化可以更好的表示一句话。

hidden_states：这是输出的一个可选项，如果输出，需要指定config.output_hidden_states=True,它是一个元组，含有13个元素，第一个元素可以当做是embedding，其余12个元素是各层隐藏状态的输出，每个元素的形状是(batch_size, sequence_length, hidden_size)，

attentions：这也是输出的一个可选项，如果输出，需要指定config.output_attentions=True,它也是一个元组，含有12个元素，包含每的层注意力权重，用于计算self-attention heads的加权平均值
————————————————
版权声明：本文为CSDN博主「uan_cs」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/qq_41971355/article/details/121124868

import torch
from torch import tensor
from transformers import BertConfig, BertTokenizer, BertModel
 
model_path = 'model/chinese-roberta-wwm-ext/'#已下载的预训练模型文件路径
config = BertConfig.from_pretrained(model_path, output_hidden_states = True, output_attentions=True)
assert config.output_hidden_states == True
assert config.output_attentions == True
model = BertModel.from_pretrained(model_path, config = config)
tokenizer = BertTokenizer.from_pretrained(model_path)
 
text = '我热爱这个世界'
 
# input = tokenizer(text)
# {'input_ids': [101, 2769, 4178, 4263, 6821, 702, 686, 4518, 102], 
#'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
#'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
 
# input = tokenizer.encode(text)
# [101, 2769, 4178, 4263, 6821, 702, 686, 4518, 102]
 
# input = tokenizer.encode_plus(text)
# {'input_ids': [101, 2769, 4178, 4263, 6821, 702, 686, 4518, 102], 
#'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
#'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
 
input_ids = torch.tensor([tokenizer.encode(text)], dtype=torch.long)#一个输入也需要组batch
print(input_ids.shape)
#torch.Size([1, 9])
 
model.eval()
output = model(input_ids)
print(len(output))
print(output[0].shape) #最后一层的隐藏状态 （batch_size, sequence_length, hidden_size)
print(output[1].shape) #第一个token即（cls）最后一层的隐藏状态 (batch_size, hidden_size)
print(len(output[2])) #需要指定 output_hidden_states = True， 包含所有隐藏状态，第一个元素是embedding, 其余元素是各层的输出 (batch_size, sequence_length, hidden_size)
print(len(output[3])) #需要指定output_attentions=True，包含每一层的注意力权重，用于计算self-attention heads的加权平均值(batch_size, layer_nums, sequence_length, sequence_legth)
# 4
# torch.Size([1, 9, 768])
# torch.Size([1, 768])
# 13
# 12
 
all_hidden_state = output[2]
print(all_hidden_state[0].shape)
print(all_hidden_state[1].shape)
print(all_hidden_state[2].shape)
# torch.Size([1, 9, 768])
# torch.Size([1, 9, 768])
# torch.Size([1, 9, 768])
 
attentions = output[3]
print(attentions[0].shape)
print(attentions[1].shape)
print(attentions[2].shape)
# torch.Size([1, 12, 9, 9])
# torch.Size([1, 12, 9, 9])
# torch.Size([1, 12, 9, 9])

=======================================================

Parameters

last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally if config.is_encoder_decoder=True 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

Base class for model’s outputs that also contains a pooling of the last hidden states.

=======================================================

import torch
from transformers import BertTokenizer, BertModel
 
bertModel = BertModel.from_pretrained('bert-base-chinese', output_hidden_states=True, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
 
 
text = '让我们来看一下bert的输出都有哪些'
# encode和encode_plus的区别
# encode仅返回input_ids
# encode_ids返回所有的编码信息，具体如下：
# input_ids：单词在词典的编码；
# token_type_ids:区分两个句子的编码（上句全为0，下句全为1）；
# attention_mask：指定对哪些词进行self-attention操作。
input_ids = torch.tensor([tokenizer.encode(text)]).long()
outputs = bertModel(input_ids)
print(len(outputs))
print(outputs.keys())
print(outputs['last_hidden_state'].shape)
print(outputs['pooler_output'].shape)
print(len(outputs['hidden_states']))
print(len(outputs['attentions']))

代码输出结果：

4
odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])
torch.Size([1, 18, 768])
torch.Size([1, 768])
13
12

可以看出，bert的输出是由四部分组成：
last_hidden_state：shape是(batch_size, sequence_length, hidden_size)，hidden_size=768,它是模型最后一层输出的隐藏状态。（通常用于命名实体识别）
pooler_output：shape是(batch_size, hidden_size)，这是序列的第一个token(classification token)的最后一层的隐藏状态，它是由线性层和Tanh激活函数进一步处理的。（通常用于句子分类，至于是使用这个表示，还是使用整个输入序列的隐藏状态序列的平均化或池化，视情况而定）
hidden_states：这是输出的一个可选项，如果输出，需要指定config.output_hidden_states=True,它也是一个元组，它的第一个元素是embedding，其余元素是各层的输出，每个元素的形状是(batch_size, sequence_length, hidden_size)
attentions：这也是输出的一个可选项，如果输出，需要指定config.output_attentions=True,它也是一个元组，它的元素是每一层的注意力权重，用于计算self-attention heads的加权平均值。

参考链接：
https://www.cnblogs.com/xiximayou/p/15016604.html

=======================================================

既然bert输出每层都包含不同信息，本文就尝试将bert输出12层加上embedding层一起做分类。但是不同的层的权重怎么定呢，于是就考虑添加注意力来训练每层的权重：下面为pytroch模型的代码：

class NeuralNet(nn.Module):
    def __init__(self, model_name_or_path, hidden_size=768, num_class=3):
        super(NeuralNet, self).__init__()

        self.config = BertConfig.from_pretrained(model_name_or_path, num_labels=3)
        self.config.output_hidden_states = True#需要设置为true才输出
        self.bert = BertModel.from_pretrained(model_name_or_path, config=self.config)
        for param in self.bert.parameters():
            param.requires_grad = True
        self.weights = nn.Parameter(torch.rand(13, 1))
        self.dropouts = nn.ModuleList([
            nn.Dropout(0.5) for _ in range(5)
        ])
        self.fc = nn.Linear(hidden_size, num_class)

    def forward(self, input_ids, input_mask, segment_ids):
        last_hidden_states, pool, all_hidden_states = self.bert(input_ids, token_type_ids=segment_ids,
                                                                attention_mask=input_mask)
        batch_size = input_ids.shape[0]
        ht_cls = torch.cat(all_hidden_states)[:, :1, :].view(
            13, batch_size, 1, 768)
        atten = torch.sum(ht_cls * self.weights.view(
            13, 1, 1, 1), dim=[1, 3])
        atten = F.softmax(atten.view(-1), dim=0)
        feature = torch.sum(ht_cls * atten.view(13, 1, 1, 1), dim=[0, 2])
        for i, dropout in enumerate(self.dropouts):
            if i == 0:
                h = self.fc(dropout(feature))
            else:
                h += self.fc(dropout(feature))
        h = h / len(self.dropouts)
        return h

=======================================================

Chinese-Text-Classification-PyTor

chinese-text-classification-pytor

chinese-text-classification-pytor classification

text2vec-base-chinese

classification hugging face text

text2vec-base-chinese模型文本chinese

pytor

classification

chinese

high-temperature classification fine-grained

easy-classification