AI--向量的存储和检索

step1 Document

LangChain 实现了Document抽象,旨在表示文本单元和相关元数据。它具有两个属性:

  • page_content:代表内容的字符串;
  • metadata:包含任意元数据的字典。

    该metadata属性可以捕获有关文档来源、其与其他文档的关系以及其他信息的信息.单个Document对象通常代表较大文档的一部分。

    from langchain_core.documents import Document
    documents = [
        Document(
            page_content="Dogs are great companions, known for their loyalty and friendliness.",
            metadata = {"source": "mammal-pets-doc"},
        ),
         Document(
            page_content="Cats are independent pets that often enjoy their own space.",
            metadata={"source": "mammal-pets-doc"},
        ),
        Document(
            page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
            metadata={"source": "fish-pets-doc"},
        ),
        Document(
            page_content="Parrots are intelligent birds capable of mimicking human speech.",
            metadata={"source": "bird-pets-doc"},
        ),
        Document(
            page_content="Rabbits are social animals that need plenty of space to hop around.",
            metadata={"source": "mammal-pets-doc"},
        ),
    ]
    

    step2 向量检索

    向量检索是一种常见的存储和检索非结构化数据的方式,主要思路是存储文本的数据向量,给出一个查询,我们编码查询成同一个维度的数据向量,然后使用相似度去查找相关数据

    LangChain VectorStore对象包含用于将文本和Document对象添加到存储区以及使用各种相似度指标查询它们的方法。它们通常使用嵌入模型进行初始化,这些模型决定了如何将文本数据转换为数字向量。

    下面我是使用bce-embedding模型作为编码模型,地址下载

    from langchain.embeddings import HuggingFaceEmbeddings
    from langchain_community.vectorstores.utils import DistanceStrategy
    # init embedding model
    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'batch_size': 64, 'normalize_embeddings': True}
    embed_model = HuggingFaceEmbeddings(
        model_name=EMBEDDING_PATH,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
      )
    vetorstore = Chroma.from_documents(
        documents,
        embedding=embed_model,
    )
    vetorstore.similarity_search("cat")
    

    输出结果为:

    [Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’}),

    Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’:

    ‘fish-pets-doc’}),

    Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’:‘mammal-pets-doc’}),

    Document(page_content=‘Parrots are intelligent> birds capable of mimicking human speech.’, metadata={‘source’:‘bird-pets-doc’})]

    搜索返回相似度分数

    vetorstore.similarity_search_with_score("cat")
    

    [(Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’}),

    0.9107884),

    (Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’: ‘fish-pets-doc’}),

    1.3231826),

    (Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’: ‘mammal-pets-doc’}),

    1.4060305),

    (Document(page_content=‘Parrots are intelligent birds capable of mimicking human speech.’, metadata={‘source’: ‘bird-pets-doc’}),

    1.4284585),

    (Document(page_content=‘Rabbits are social animals that need plenty of space to hop around.’, metadata={‘source’: ‘mammal-pets-doc’}),

    1.4566814)]

    上面结果返回的score,越小表示越接近

    基于向量查询

    embedding = embed_model.embed_query("cat")
    vetorstore.similarity_search_by_vector(embedding)
    

    输出结果

    [Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’}),

    Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’: ‘fish-pets-doc’}),

    Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’: ‘mammal-pets-doc’}),

    Document(page_content=‘Parrots are intelligent birds capable of mimicking human speech.’, metadata={‘source’: ‘bird-pets-doc’})]

    step3 检索

    LangChainVectorStore对象没有Runnable子类,因此不能立即集成到 LangChain 表达语言链中。

    LangChain Retrievers是 Runnable,因此它们实现了一组标准方法(例如同步和异步invoke操作batch)并且旨在纳入 LCEL 链。

    我们可以自己创建一个简单的版本,而无需子类化Retriever。如果我们选择要使用的方法检索文档,我们可以轻松创建一个可运行的程序。下面我们将围绕该similarity_search方法构建一个:

    from typing import List
    from langchain_core.documents import Document
    from langchain_core.runnables import RunnableLambda
    retriever = RunnableLambda(vetorstore.similarity_search).bind(k=1)
    print(retriever.invoke("cat"))
    print(retriever.batch(["cat","dog"]))
    

    输出结果

    [Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’})]

    [[Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’})], [Document(page_content=‘Dogs are great companions, known for their loyalty and friendliness.’, metadata={‘source’: ‘mammal-pets-doc’})]]

    Vectorstore 实现了as_retriever一个生成 Retriever 的方法,特别是VectorStoreRetriever。这些检索器包括特定的search_type属性search_kwargs,用于标识要调用的底层向量存储的哪些方法以及如何参数化它们。

    retriever = vetorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 1},
    )
    retriever.batch(["cat", "shark"])
    

    输出结果

    [[Document(page_content=‘Cats are independent pets that often enjoy their own space.’, metadata={‘source’: ‘mammal-pets-doc’})],

    [Document(page_content=‘Goldfish are popular pets for beginners, requiring relatively simple care.’, metadata={‘source’: ‘fish-pets-doc’})]]

    检索器可以轻松地合并到更复杂的应用程序中,例如检索增强生成(RAG)应用程序,

    from langchain_openai import ChatOpenAI
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnablePassthrough
    chat = ChatOpenAI()
    message = """
    Answer this question using the provided context only.
    {question}
    Context:
    {context}
    """
    retriever = vetorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 1},
    )
    prompt = ChatPromptTemplate.from_messages(
        [
            ("human",message),
        ]
    )
    rag_chat = {"context":retriever,"question":RunnablePassthrough()} | prompt |chat
    response = rag_chat.invoke("tell me about cats")
    print(response.content)
    

    输出结果

    Cats are independent pets that often enjoy their own space.