Instruct-GPT

发布时间 2023-07-08 14:55:04作者: 戈壁与草原

数据收集细节

   InstructGPT中数据收集是一个关键的过程,包括收集什么类型的数据、如何筛选标注人员等等。InstructGPT类型的数据是与InstructGPT三阶段训练相对应,而筛选标注人员,则是为了收集的数据质量更高。从下面数据收集细节可以看出为什么要进行标注人员筛选。

标注人员筛选

  要求标注人员,针对分布广泛的自然语言提示(prompt)数据集[注1],具备评判能力。其中一部分数据还是敏感的。InstructGPT团队进行了一项测试,以此来筛选出能对敏感内容能以较高倾向进行识别的标注人员。筛选准则如下:

  • 模型结果排序:一组提示,每个提示对应多个模型的输出。标注人员需要根据整体质量对模型的输出进行排序。然后与研究人员标注的排序结果进行对比。
  • 敏感言论: 创建一组提示及对应回答数据集(prompt, completion),其中一些提示或者回答是敏感的(能引起强烈负面感受,如毒害、黄色、暴力、评判、政治等)。InstructGPT团队自己也标注了这些数据,然后将这些候选标注人员标注的结果与之进行对比。
  • 自我评估辨别针对不同群体的敏感言论:希望雇佣能判别广泛敏感内容的标注人员。但是由于法律原因,又不能根据人口统计规则雇佣相关人员。因此要求候选标注人员,填写或回答如“对于哪些主题或者文化群体,您可以轻松识别敏感言论?”,然后将其作为筛选的一部分。

标注说明

  在训练数据标注过程中,要求标注人员将有帮助作为重要的准则,高于真实性和无伤害标准。而最终的评估过程中,却要求标注人员优先将真实性和无伤害作为重要的准则。作者也在探索研究途径,在训练过程中,让模型输出优先真实性和无伤害,而不是有帮助。特别是通过拒绝方式,让模型对一些特定的指令拒绝回答。但是这也面临一些挑战:不同的应用具有不同的风险等级,期望模型推理阶段,拒绝回答是可以配置的。此外还存在风险,模型过于概括、拒绝回答无害的指令。这对应大多数应用是不希望看到的。

标注人员人工统计数据

  通过向标注人员发送自愿匿名问卷调查,以便了解标注人员的人口统计信息。 说明InstructGPT很重视Bias,从数据集标注这块,减缓让数据标签Bias。

标注页面如图1所示:

  • 指定prompt/instruct的模型输出打分 1-7分
  • 不同方面打标签: 是否正确执行指令;作为客户助手而言回答是否合适;是否包含色情内容;是否包含暴力内容;是否鼓励或者没有阻拦暴力、虐待、恐怖、自残;是否诋毁受包含类;是否给出了有害建议;是否进行道德评判;
  • 同一prompt/instruct的不同模型输出按质量好坏进行排序
图 1

[注1]
instruct dataset说明: 形式上有三部分构成:(指令、输入、输出)或者(instruct, input, output)。
举例说明:

  • instruct: 请以下面几个词语为主题写一篇不少于800字的文章
  • input: 助人为乐、见义勇为
  • output: xxx

但是指令与输入并没有明显的区分,可以不指定input或者input为空。
举例说明:

  • instruct: 请以助人为乐、见义勇为为主题写一篇不少于800字的文章
  • input: ""
  • output: xxx

instruct dataset 是如何获取到的?
instruct: 用户提交到API中的,标注人员编写,这些都是人工生成;还有就是也可以由模型生成如self-instruct中介绍的方法
ouput:也是模型生成+人工编写

模型

TODO

模型评估

TODO

InstructGPT 概括说明

  越大的NLP模型并未更好地理解用户的意图,比如模型输出的信息不真实、有害、有偏见或者没有帮助。而这种状况并没有随着模型变大而改善。因此InstructGPT一文旨在改善这种状况,并将模型理解用户意图并按照现在社会的法律和道德规范输出,这种能力称之为与人类对齐。下面分三步来让模型具备这种能力,如图2所示。

  • Step1: 选择指令数据:其中instruct样本来自用户提交到API上的,以及标注人员人工编写的。对应的input-output则是标注人员人工编写的。然后基于此种数据集微调GPT-3。
  • Step2: 基于大量的API instructs, 对应每一条instruct都由不同模型生成多个不同的output,然后由人工进行标注排序。基于这种数据集,训练reward model。
  • Step3: 使用上述RM模式评估GPT-3,通过强化学习不断优化模型output 满足RM模型较高评价分数。
图 2

标注人员写的三种instruct数据

  • 简单的: 仅要求标注人员任意写instruct示例,但要求这些示例要足够多样
  • 一对多:要求标注人员写instruct示例,同时要求写出与该instruct示例想对应的多个(input, output)样本
  • 基于API的:基于用户提交到OpenAI API中的instruct示例,要求标注人员对应其中的每个instruct示例,都写出与之相似的或相同含义的示例。

图3 是对提交到API的instruct样本进行统计分类如下图所示。大多数的instruct样本是生产式的,而不是分类或者问答类问题。

图 3

表1 是来自InstructGPT分布的用户提交prompt示例。我对比看了来自GPT3分布的用户提交的prompt示例,但是感觉不出来区别。

表 1
Use Case Example
brainstorming List five ideas for how to regain enthusiasm for my career
brainstorming What are some key points I should know when studying Ancient Greece?
brainstorming What are 4 questions a user might have after reading the instruction manual for a trash compactor?

{user manual}

1.
brainstorming What are 10 science fiction books I should read next?
classification Take the following text and rate, on a scale from 1-10, how sarcastic the person is being (1 = not at all, 10 = extremely sarcastic). Also give an explanation

{text}
Rating:
classification This is a list of tweets and the sentiment categories they fall into.
Tweet: {tweet_content1}
Sentiment: {sentiment1}
Tweet: {tweet_content2}
Sentiment:
classification {java code}
What language is the code above written in?
classification You are a very serious professor, and you check papers to see if they contain missing citations. Given the text, say whether it is missing an important citation (YES/NO) and which sentence(s) require citing.
extract Extract all course titles from the table below:
extract Extract all place names from the article below:
extract Given the following list of movie titles, write down any names of cities in the titles.
generation Write a creative ad for the following product to run on Facebook aimed at parents:
Product:
generation Write a short story where a brown bear to the beach, makes friends with a seal, and then return home.
generation Here’s a message to me:

{email}

Here are some bullet points for a reply:

{message}

Write a detailed reply
generation This is an article about how to write a cover letter when applying for jobs:

It’s important to spend some time
generation write rap lyrics on the topics mentioned in this news article:
—-
{article}
—-
rewrite This is the summary of a Broadway play:
"""
{summary}
"""
This is the outline of the commercial for that play:
"""
rewrite Translate this sentence to Spanish:
rewrite Create turn-by-turn navigation given this text: Go west on {road1} unto you hit {road2}.Desination will be a red barn on the right then take it east to {road3}.
1.
rewrite Rewrite the following text to be more light-hearted:

{very formal text}
chat The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.
Human: Hello, who are you?
AI: I am an AI created by OpenAI. How can I help you today?
Human: I’d like to cancel my subscription.
AI:
chat Marv is a chatbot that reluctantly answers questions with sarcastic responses:
You: How many pounds are in a kilogram?
Marv: This again? There are 2.2 pounds in a kilogram. Please make a note of this.
You: What does HTML stand for?
Marv: Was Google too busy? Hypertext Markup Language. The T is for try to ask better questions in the future.
You: When did the first airplane fly?
Marv:
chat This is a conversation with an enlightened Buddha. Every response is full of wisdom and love.
Me: How can I achieve greater peace and equanimity?
Buddha:
closed qa Help me answer questions about the following short story:
{story}
What is the moral of the story?
closed qa Answer the following question:
What shape is the earth?
A) A circle
B) A sphere
C) An ellipse
D) A plane
closed qa Tell me how hydrogen and helium are different, using the following facts:
open qa I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with "Unknown".
Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.
Q: Who was president of the United States in 1955?
A:
open qa Who built the statue of liberty?
open qa How do you take the derivative of the sin function?
open qa who are the indiginous people of New Zealand?
summarization Summarize this for a second-grade student:
summarization {news article}
Tl;dr:
summarization {chat transcript}
Summarize the above conversation between a customer and customer
assistant. Make sure to state any complaints that the customer has.
other start with where
other Look up "cowboy" on Google and give me the results.
other Johnathan Silver goes to the market every day, and brings back a