Skip to content

A possible QA application on local database based on langchain and llama-index

License

Notifications You must be signed in to change notification settings

RanchiZhao/RUCQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RUCQA - Based on LlamaIndex & LangChain

0.Task Description

Goal:

  • Build a Knowledge-based Question-Answering System - RUCQA

Requirements:

  • No corpus is provided, it needs to be crawled from the archive of articles published on the People's University News website over the past year (from June 2022 to May 2023) (also the evaluation range).
  • The LLM interface needs to be invoked.

Evaluation:

  • 100 fill-in-the-blank questions, checking for correctness of the answers.
  • 60 simple questions: Directly extracted from the corpus (similar to reading comprehension tasks).
  • 40 difficult questions: Require summarizing and reasoning based on the corpus.
  • 20 open questions, assessing the correctness and fluency of the generated answers. No standard answers, flexible generation (up to 512 characters). The basis for the generated answer needs to be given (up to 5 specific, relevant news pieces). This part is evaluated manually.

Examples::

For the convenience of evaluation, it is hoped that answers of a numeric type (such as counting questions, calculation questions, etc.) are directly output in the form of Arabic numerals. Other types of answers will be overwritten by the assistant according to the rules during the evaluation.

1.Document Crawler

A total of 4,533 documents have been crawled based on the requirements from the website: RUC NEWS

2.Overall Structure

Overall Structure

The overall strategy is to use one route based on vector retrieval and another based on keyword retrieval. The vector-based retrieval is achieved by calling the Llama Index (which performs particularly well for questions that require global information). The keyword-based approach uses Langchain to break down the overall task into a series of consecutive subtasks (which performs particularly well for simple questions).

Finally, the answers from the two approaches are merged to produce the final answer.

In addition, for different question categories, corresponding prompts have been designed in the keyword-based retrieval pathway to enhance the ability to perform tasks of that particular category.

3.VRM(Vector store index based Retrieval Module) - Powered by Llama Index

3.1 Load in Documents

This phase involves loading the data. We use the SimpleDirectoryReader class and its load_data function to obtain Document objects.

3.2 Index Construction

At this stage, we construct an index over these Document objects, specifically with GPTVectorStoreIndex. By default, OpenAI's text-davinci-003 model is employed, with its max_tokens set at 128.

Notably, we retain the default prompt templates during the process of index construction and querying, without adding customized embeddings to our model— the defaults are adequate. For the embedding method, according to m3e-base, the openai-ada-002 model (which we employ) is sufficiently effective for a range of Chinese tasks, thereby negating the necessity of text2vec. We may explore the potential of m3e-base in the future.

3.3 Query the index

Given our use of VectorStoreIndex, the VectorIndexRetriever should be used to construct our QueryEngine. At this point, we can feed the model a query to find an answer!

A useful approach here is to append the string '\nPlease note specifically: You need to answer in Chinese, and you can only refer to the materials I provide. It is not allowed to view external materials.' to the query to restrict the model.

3.4 More information about Vector Store Index

The Vector Store Index holds each Node along with a corresponding embedding within a Vector Store.

img

Querying a Vector Store Index involves retrieving the top-k most similar Nodes (in this case, we set k to 2) and passing these to our Response Synthesis module. You can find more details about this on ReadTheDocs.

img

For this part, more details available on ReadTheDocs

4.(KRM)Keywords based Retrieval Module - Powered by LangChain

Given the relatively straightforward nature of the provided questions, we require a more robust retrieval method, namely one based on the keywords found within queries.

Our pipeline for this process includes:

<0> Determine the classification of the query

<1> Extract keywords from the query

<2> Search for the most related chunks based on these keywords

<3> Find the sentence related to the answer within these chunks (which serve as reference materials here)

<4> Determine the answer based on the answer-related sentence

<5> Integrate the answer from this process with the answer from VRM into a single response

<6> Simplify the integrated answer into the required format

As evident, we break the whole process down into manageable tasks. For each segment, we delegate these specific subdivisions of tasks to the LLM via LangChain (except for step <2>).

Specifically, for different categories of queries identified by the discriminator, the specific implementation of each step (the design of the prompt) may vary, but the overall idea is as you see!

The prompts used in this process play a crucial role. We have designed them with care, primarily employing In-Context Learning (ICL), and the entire procedure can, to some degree, be seen as a Chain of Thought (CoT).

About

A possible QA application on local database based on langchain and llama-index

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages