openAI cookbook - embedding

发布时间 2023-04-27 13:59:28作者: fxjwind

https://github.com/openai/openai-cookbook

 

Embedding是什么意思就不说了

基于大模型的Embedding本身是包含比文本更多的内涵的,因为包含了大量的相关性

但Embedding怎么用,基本逻辑是文本相似性

所以Semantic search是最简单的,把Embedding存到向量数据库里面,search就行

推荐也是类似的

问答就要多一步,把检索出来的文本作为输入给到大模型,让大模型给出更准确的答案

Semantic search

Embeddings can be used for search either by themselves or as a feature in a larger system.

The simplest way to use embeddings for search is as follows:

  • Before the search (precompute):
    • Split your text corpus into chunks smaller than the token limit (8,191 tokens for text-embedding-ada-002)
    • Embed each chunk of text
    • Store those embeddings in your own database or in a vector search provider like PineconeWeaviate or Qdrant
  • At the time of the search (live compute):
    • Embed the search query
    • Find the closest embeddings in your database
    • Return the top results

An example of how to use embeddings for search is shown in Semantic_text_search_using_embeddings.ipynb.

In more advanced search systems, the cosine similarity of embeddings can be used as one feature among many in ranking search results.

Question answering

The best way to get reliably honest answers from GPT-3 is to give it source documents in which it can locate correct answers. Using the semantic search procedure above, you can cheaply search a corpus of documents for relevant information and then give that information to GPT-3, via the prompt, to answer a question. We demonstrate in Question_answering_using_embeddings.ipynb.

Recommendations

Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set.

An example of how to use embeddings for recommendations is shown in Recommendation_using_embeddings.ipynb.

Similar to search, these cosine similarity scores can either be used on their own to rank items or as features in larger ranking algorithms.

Customizing Embeddings

Although OpenAI's embedding model weights cannot be fine-tuned, you can nevertheless use training data to customize embeddings to your application.

In Customizing_embeddings.ipynb, we provide an example method for customizing your embeddings using training data. The idea of the method is to train a custom matrix to multiply embedding vectors by in order to get new customized embeddings. With good training data, this custom matrix will help emphasize the features relevant to your training labels. You can equivalently consider the matrix multiplication as (a) a modification of the embeddings or (b) a modification of the distance function used to measure the distances between embeddings.

 

获取embedding的方法很简单,

输入给出input和模型就可以

import openai

embedding = openai.Embedding.create(
    input="Your text goes here", model="text-embedding-ada-002"
)["data"][0]["embedding"]
len(embedding)

 

重点看下QA的场景,

如果你要让GPT回答一些他不知道的知识时,应该怎么做?

What should you do if you want GPT to answer questions about unfamiliar topics? E.g.,

  • Recent events after Sep 2021
  • Your non-public documents
  • Information from past conversations
  • etc.

This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference text.

  1. Search: search your library of text for relevant text sections
  2. Ask: insert the retrieved text sections into a message to GPT and ask it the question

 

这段说的很清楚,为何embedding的方式要好过finetune

finetune可能更适合,teach一种task或styles,一种模式;对于知识,我的理解,finetune的量不足以对存量有明显的影响

所以采用输入的方式会更好,但这里的问题是,模型的input是有限的,gpt-3.5只有4096个tokens,怎么解决

Why search is better than fine-tuning

GPT can learn knowledge in two ways:

  • Via model weights (i.e., fine-tune the model on a training set)
  • Via model inputs (i.e., insert the knowledge into an input message)

Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall.

As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read.

In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.

One downside of text search relative to fine-tuning is that each model is limited by a maximum amount of text it can read at once:

ModelMaximum text length
gpt-3.5-turbo 4,096 tokens (~5 pages)
gpt-4 8,192 tokens (~10 pages)
gpt-4-32k 32,768 tokens (~40 pages)

Continuing the analogy, you can think of the model like a student who can only look at a few pages of notes at a time, despite potentially having shelves of textbooks to draw upon.

Therefore, to build a system capable of drawing upon large quantities of text to answer questions, we recommend using a Search-Ask approach.

 

这里给出的答案是通过seach过滤出相关的文本

Text can be searched in many ways. E.g.,

  • Lexical-based search
  • Graph-based search
  • Embedding-based search

This example notebook uses embedding-based search. Embeddings are simple to implement and work especially well with questions, as questions often don't lexically overlap with their answers.

Consider embeddings-only search as a starting point for your own system.
Better search systems might combine multiple search methods, along with features like popularity, recency, user history, redundancy with prior search results, click rate data, etc.
Q&A retrieval performance may also be improved with techniques like HyDE, in which questions are first transformed into hypothetical answers before being embedded.
Similarly, GPT can also potentially improve search results by automatically transforming questions into sets of keywords or search terms.

 

所以一个QA的完整的过程如下

Full procedure

Specifically, this notebook demonstrates the following procedure:

  1. Prepare search data (once)
    1. Collect: We'll download a few hundred Wikipedia articles about the 2022 Olympics
    2. Chunk: Documents are split into short, mostly self-contained sections to be embedded
    3. Embed: Each section is embedded with the OpenAI API
    4. Store: Embeddings are saved (for large datasets, use a vector database)
  2. Search (once per query)
    1. Given a user question, generate an embedding for the query from the OpenAI API
    2. Using the embeddings, rank the text sections by relevance to the query
  3. Ask (once per query)
    1. Insert the question and the most relevant sections into a message to GPT
    2. Return GPT's answer

看例子,

这里先做了一个试验,

直接问gpt

Which athletes won the gold medal in curling at the 2022 Winter Olympics?

他是不知道的

那正确的做法是,在prompt把相关的上下文资料给gpt,那么gpt就能回答

query = f"""Use the below article on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{wikipedia_article_on_curling}
\"\"\"

Question: Which athletes won the gold medal in curling at the 2022 Winter Olympics?"""

直接看如何ask的代码实例,

这里他search没有用向量数据库,直接在内存里面用spatial.distance.cosine算的,主要是作为例子简单

这里需要注意的是,把search到的资料传给openai的时候最好token化一下,判断一下token bucket

其他的步骤就很直觉了

# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]


def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

 

对于向量数据库的使用,

https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb

这里也给出每一种的例子

 

对于文本太长,如何生成embedding,

这里给出两种方案,truncate很自然

还可以chunking

把一个文本分成多段,生成多个embedding

然后有两种选择,分开用,这个对于search,好像也没问题

或者把多个embedding合并成一个,例子里面给的是average的方法

https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb