HuggingFace | HuggingFace中from_pretrained函数的加载-526互联

我们使用huggingface的from_pretrained()函数加载模型和tokenizer，那么加载这些需要什么文件？

加载模型

测试代码：如果加载成功，就打印1。

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("./bert-base-chinese")

print(1)

文件目录结构：

|- bert-base-chinese
|-- 各种checkpoint文件
|- test.py

如果checkpoint文件只有pytorch_model.bin：

OSError: ./bert-base-chinese does not appear to have a file named config.json. Checkout 'https://huggingface.co/./bert-base-chinese/None' for available files.

那么，如果checkpoint文件有pytorch_model.bin和config.json：

Some weights of the model checkpoint at ./bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
1

说明：

使用from_pretrained()函数加载模型需要pytorch_model.bin和config.json文件。

加载tokenizer

测试代码：如果加载成功，就打印1。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./bert-base-chinese")

print(1)

文件目录结构：

|- bert-base-chinese
|-- 各种checkpoint文件
|- test.py

如果checkpoint文件只有tokenizer.json：

OSError: ./bert-base-chinese does not appear to have a file named config.json. Checkout 'https://huggingface.co/./bert-base-chinese/None' for available files.

那么，如果checkpoint文件有tokenizer.json和config.json：

说明：

使用from_pretrained()函数加载模型需要tokenizer.json和config.json文件。但是我们还需要把对应的tokenizer_config.json文件和vocab.txt文件也加进去，因为会在后续使用。

项目组件

一个完整的transformer模型主要包含三部分：

Config，控制模型的名称、最终输出的样式、隐藏层宽度和深度、激活函数的类别等。将Config类导出时文件格式为 json格式，就像下面这样：

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

当然，也可以通过config.json来实例化Config类，这是一个互逆的过程。

Tokenizer，这是一个将纯文本转换为编码的过程。注意，Tokenizer并不涉及将词转化为词向量的过程，仅仅是将纯文本分词，添加[MASK]标记、[SEP]、[CLS]标记，并转换为字典索引。Tokenizer类导出时将分为三个文件，也就是：

vocab.txt

词典文件，每一行为一个词或词的一部分

special_tokens_map.json 特殊标记的定义方式

{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer_config.json 配置文件，主要存储特殊的配置。

Model，也就是各种各样的模型。除了初始的Bert、GPT等基本模型，针对下游任务，还定义了诸如BertForQuestionAnswering等下游任务模型。模型导出时将生成config.json和pytorch_model.bin参数文件。前者就是1中的配置文件，这和我们的直觉相同，即config和model应该是紧密联系在一起的两个类。后者其实和torch.save()存储得到的文件是相同的，这是因为Model都直接或者间接继承了Pytorch的Module类。从这里可以看出，HuggingFace在实现时很好地尊重了Pytorch的原生API。

huggingface from_pretrained pretrained函数

from_pretrained

pretrainedmodel from_pretrained load_state_dict

39 userwarning deprecated pretrained

subtokenization investigating pretraining language