Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable IRIS to use FAQs #187

Merged
merged 24 commits into from
Jan 30, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
2fe490b
Initial setup of faq ingestion and deletion
cremertim Dec 19, 2024
19c0d81
Initial setup of faq retrieval
cremertim Dec 25, 2024
933c045
Further faq retrieval.
cremertim Dec 26, 2024
54f5167
Working FAQ retrival
cremertim Dec 27, 2024
3b12cd6
Removed logging
cremertim Jan 6, 2025
1017477
Removed logging, added Links for FAQ answer, updated prompts
cremertim Jan 8, 2025
e8f611c
Added language
cremertim Jan 13, 2025
cfb1240
Merge branch 'main' into feature/faq/basic-faq-pipe
cremertim Jan 13, 2025
edfbfa2
Increased faq limit
cremertim Jan 13, 2025
ce2904a
Merge remote-tracking branch 'origin/feature/faq/basic-faq-pipe' into…
cremertim Jan 13, 2025
27abf69
Reformat
cremertim Jan 18, 2025
82a55c3
Fix coderabit
cremertim Jan 24, 2025
f1621ae
Fix docs
cremertim Jan 24, 2025
dcb3e15
Fixed the linter checks
cremertim Jan 27, 2025
bf52342
Fixed the linter checks
cremertim Jan 27, 2025
6163e6f
Refactored FAQ retrival pipeline to reduce code duplication
cremertim Jan 29, 2025
2fcdf4e
Refactored FAQ retrival pipeline to reduce code duplication
cremertim Jan 30, 2025
430b777
Refactored FAQ retrival pipeline to reduce code duplication
cremertim Jan 30, 2025
1242ef0
Merge branch 'main' into feature/faq/basic-faq-pipe
cremertim Jan 30, 2025
cf2feac
Remove unused import
cremertim Jan 30, 2025
dd49dc0
fix typo
cremertim Jan 30, 2025
a36dfcc
Merge branch 'main' into feature/faq/basic-faq-pipe
cremertim Jan 30, 2025
0921e1a
linter
cremertim Jan 30, 2025
1ca23bd
Merge remote-tracking branch 'origin/feature/faq/basic-faq-pipe' into…
cremertim Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions app/common/PipelineEnum.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ class PipelineEnum(str, Enum):
IRIS_SUMMARY_PIPELINE = "IRIS_SUMMARY_PIPELINE"
IRIS_LECTURE_RETRIEVAL_PIPELINE = "IRIS_LECTURE_RETRIEVAL_PIPELINE"
IRIS_LECTURE_INGESTION = "IRIS_LECTURE_INGESTION"
IRIS_FAQ_INGESTION = "IRIS_FAQ_INGESTION"
IRIS_FAQ_RETRIEVAL_PIPELINE = "IRIS_FAQ_RETRIEVAL_PIPELINE"
NOT_SET = "NOT_SET"
13 changes: 13 additions & 0 deletions app/domain/data/faq_dto.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from pydantic import BaseModel, Field


class FaqDTO(BaseModel):
faq_id: int = Field(alias="faqId")
course_id: int = Field(alias="courseId")
question_title: str = Field(alias="questionTitle")
question_answer: str = Field(alias="questionAnswer")
course_name: str = Field(default="", alias="courseName")
course_description: str = Field(default="", alias="courseDescription")



8 changes: 8 additions & 0 deletions app/domain/ingestion/deletionPipelineExecutionDto.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from pydantic import Field

from app.domain import PipelineExecutionDTO, PipelineExecutionSettingsDTO
from app.domain.data.faq_dto import FaqDTO
from app.domain.data.lecture_unit_dto import LectureUnitDTO
from app.domain.status.stage_dto import StageDTO

Expand All @@ -13,3 +14,10 @@ class LecturesDeletionExecutionDto(PipelineExecutionDTO):
initial_stages: Optional[List[StageDTO]] = Field(
default=None, alias="initialStages"
)

class FaqDeletionExecutionDto(PipelineExecutionDTO):
faq: FaqDTO = Field(..., alias="pyrisFaqWebhookDTO")
settings: Optional[PipelineExecutionSettingsDTO]
initial_stages: Optional[List[StageDTO]] = Field(
default=None, alias="initialStages"
)
9 changes: 9 additions & 0 deletions app/domain/ingestion/ingestion_pipeline_execution_dto.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from pydantic import Field

from app.domain import PipelineExecutionDTO, PipelineExecutionSettingsDTO
from app.domain.data.faq_dto import FaqDTO
from app.domain.data.lecture_unit_dto import LectureUnitDTO
from app.domain.status.stage_dto import StageDTO

Expand All @@ -13,3 +14,11 @@ class IngestionPipelineExecutionDto(PipelineExecutionDTO):
initial_stages: Optional[List[StageDTO]] = Field(
default=None, alias="initialStages"
)

class FaqIngestionPipelineExecutionDto(PipelineExecutionDTO):
faq: FaqDTO = Field(..., alias="pyrisFaqWebhookDTO")
settings: Optional[PipelineExecutionSettingsDTO]
initial_stages: Optional[List[StageDTO]] = Field(
default=None, alias="initialStages"
)

74 changes: 69 additions & 5 deletions app/pipeline/chat/course_chat_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from langchain_core.messages import SystemMessage
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import (
ChatPromptTemplate,
ChatPromptTemplate, SystemMessagePromptTemplate,
)
from langchain_core.runnables import Runnable
from langsmith import traceable
Expand All @@ -20,7 +20,7 @@
InteractionSuggestionPipeline,
)
from .lecture_chat_pipeline import LectureChatPipeline
from ..shared.citation_pipeline import CitationPipeline
from ..shared.citation_pipeline import CitationPipeline, InformationType
from ..shared.utils import generate_structured_tools_from_functions
from ...common.message_converters import convert_iris_message_to_langchain_message
from ...common.pyris_message import PyrisMessage
Expand All @@ -42,8 +42,10 @@
)
from ...domain import CourseChatPipelineExecutionDTO
from app.common.PipelineEnum import PipelineEnum
from ...retrieval.faq_retrieval import FaqRetrieval
from ...retrieval.lecture_retrieval import LectureRetrieval
from ...vector_database.database import VectorDatabase
from ...vector_database.faq_schema import FaqSchema
from ...vector_database.lecture_schema import LectureSchema
from ...web.status.status_update import (
CourseChatStatusCallback,
Expand Down Expand Up @@ -80,6 +82,7 @@ class CourseChatPipeline(Pipeline):
variant: str
event: str | None
retrieved_paragraphs: List[dict] = None
retrieved_faqs: List[dict] = None

def __init__(
self,
Expand All @@ -105,7 +108,8 @@ def __init__(
self.callback = callback

self.db = VectorDatabase()
self.retriever = LectureRetrieval(self.db.client)
self.lecture_retriever = LectureRetrieval(self.db.client)
self.faq_retriever = FaqRetrieval(self.db.client)
self.suggestion_pipeline = InteractionSuggestionPipeline(variant="course")
self.citation_pipeline = CitationPipeline()

Expand Down Expand Up @@ -273,7 +277,7 @@ def lecture_content_retrieval() -> str:
Only use this once.
"""
self.callback.in_progress("Retrieving lecture content ...")
self.retrieved_paragraphs = self.retriever(
self.retrieved_paragraphs = self.lecture_retriever(
chat_history=history,
student_query=query.contents[0].text_content,
result_limit=5,
Expand All @@ -293,6 +297,37 @@ def lecture_content_retrieval() -> str:
result += lct
return result

def faq_content_retrieval() -> str:
"""
Use this tool to retrieve information from indexed FAQs.
It is suitable when no other tool fits, you think it is a common question or the question is frequently asked,
or the question could be effectively answered by an FAQ. Also use this if the question is explicitly organizational and course-related.
An organizational question about the course might be "What is the course structure?" or "How do I enroll?" or exam related content like "When is the exam".
The tool performs a RAG retrieval based on the chat history to find the most relevant FAQs. Each FAQ follows this format:
cremertim marked this conversation as resolved.
Show resolved Hide resolved
FAQ ID, FAQ Question, FAQ Answer.
Respond to the query concisely and solely using the answer from the relevant FAQs. This tool should only be used once per query.

"""
self.callback.in_progress("Retrieving faq content ...")
self.retrieved_faqs = self.faq_retriever(
chat_history=history,
student_query=query.contents[0].text_content,
result_limit=5,
course_name=dto.course.name,
course_id=dto.course.id,
base_url=dto.settings.artemis_base_url,
)

result = ""
for faq in self.retrieved_faqs:
res = "[FAQ ID: {}, FAQ Question: {}, FAQ Answer: {}]".format(
faq.get(FaqSchema.FAQ_ID.value),
faq.get(FaqSchema.QUESTION_TITLE.value),
faq.get(FaqSchema.QUESTION_ANSWER.value),
)
result += res
return result

if dto.user.id % 3 < 2:
iris_initial_system_prompt = tell_iris_initial_system_prompt
begin_agent_prompt = tell_begin_agent_prompt
Expand Down Expand Up @@ -391,6 +426,11 @@ def lecture_content_retrieval() -> str:
if self.should_allow_lecture_tool(dto.course.id):
tool_list.append(lecture_content_retrieval)

if self.should_allow_faq_tool(dto.course.id):
tool_list.append(faq_content_retrieval)



tools = generate_structured_tools_from_functions(tool_list)
# No idea why we need this extra contrary to exercise chat agent in this case, but solves the issue.
params.update({"tools": tools})
Expand All @@ -411,9 +451,12 @@ def lecture_content_retrieval() -> str:

if self.retrieved_paragraphs:
self.callback.in_progress("Augmenting response ...")
out = self.citation_pipeline(self.retrieved_paragraphs, out)
out = self.citation_pipeline(self.retrieved_paragraphs, out, InformationType.PARAGRAPHS)
self.tokens.extend(self.citation_pipeline.tokens)

if self.retrieved_faqs:
self.callback.in_progress("Augmenting response ...")
out = self.citation_pipeline(self.retrieved_faqs, out, InformationType.FAQS, base_url=dto.settings.artemis_base_url)
self.callback.done("Response created", final_result=out, tokens=self.tokens)

# try:
Expand Down Expand Up @@ -465,9 +508,30 @@ def should_allow_lecture_tool(self, course_id: int) -> bool:
return len(result.objects) > 0
return False

def should_allow_faq_tool(self, course_id: int) -> bool:
"""
Checks if there are indexed faqs for the given course

:param course_id: The course ID
:return: True if there are indexed lectures for the course, False otherwise
"""
if course_id:
# Fetch the first object that matches the course ID with the language property
result = self.db.faqs.query.fetch_objects(
filters=Filter.by_property(FaqSchema.COURSE_ID.value).equal(
course_id
),
limit=1,
return_properties=[FaqSchema.COURSE_NAME.value],
)
return len(result.objects) > 0
return False


def datetime_to_string(dt: Optional[datetime]) -> str:
if dt is None:
return "No date provided"
else:
return dt.strftime("%Y-%m-%d %H:%M:%S")


146 changes: 146 additions & 0 deletions app/pipeline/faq_ingestion_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
import logging
import threading
from asyncio.log import logger
from typing import Optional, List, Dict
from langchain_core.output_parsers import StrOutputParser
from openai import OpenAI
cremertim marked this conversation as resolved.
Show resolved Hide resolved
from weaviate import WeaviateClient
from weaviate.classes.query import Filter
from . import Pipeline
from ..domain.data.faq_dto import FaqDTO

from app.domain.ingestion.ingestion_pipeline_execution_dto import (
FaqIngestionPipelineExecutionDto,
)
from ..llm.langchain import IrisLangchainChatModel
from ..vector_database.faq_schema import FaqSchema, init_faq_schema
from ..ingestion.abstract_ingestion import AbstractIngestion
from ..llm import (
BasicRequestHandler,
CompletionArguments,
CapabilityRequestHandler,
RequirementList,
)
from ..web.status.faq_ingestion_status_callback import FaqIngestionStatus

batch_update_lock = threading.Lock()
cremertim marked this conversation as resolved.
Show resolved Hide resolved

class FaqIngestionPipeline(AbstractIngestion, Pipeline):

def __init__(
self,
client: WeaviateClient,
dto: Optional[FaqIngestionPipelineExecutionDto],
callback: FaqIngestionStatus,
):
super().__init__()
self.client = client
self.collection = init_faq_schema(client)
self.dto = dto
self.llm_embedding = BasicRequestHandler("embedding-small")
self.callback = callback
request_handler = CapabilityRequestHandler(
requirements=RequirementList(
gpt_version_equivalent=4.25,
context_length=16385,
privacy_compliance=True,
)
)
completion_args = CompletionArguments(temperature=0.2, max_tokens=2000)
self.llm = IrisLangchainChatModel(
request_handler=request_handler, completion_args=completion_args
)
self.pipeline = self.llm | StrOutputParser()
self.tokens = []

def __call__(self) -> bool:
try:
self.callback.in_progress("Deleting old faq from database...")
self.delete_faq(
self.dto.faq.faq_id,
self.dto.faq.course_id,
)
self.callback.done("Old faq removed")
self.callback.in_progress("Ingesting faqs into database...")
self.batch_update(self.dto.faq)
self.callback.done("Faq Ingestion Finished", tokens=self.tokens)
logger.info(
f"Faq ingestion pipeline finished Successfully for faq: {self.dto.faq.faq_id}"
)
return True
except Exception as e:
logger.error(f"Error updating faq: {e}")
self.callback.error(
f"Failed to faq into the database: {e}",
exception=e,
tokens=self.tokens,
)
return False

def batch_update(self, faq: FaqDTO):
"""
Batch update the faq into the database
This method is thread-safe and can only be executed by one thread at a time.
Weaviate limitation.
"""
global batch_update_lock
with batch_update_lock:
with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
try:
embed_chunk = self.llm_embedding.embed(
f"{faq.question_title} : {faq.question_answer}"
)
faq_dict = faq.model_dump()

batch.add_object(properties=faq_dict, vector=embed_chunk)


except Exception as e:
logger.error(f"Error updating faq: {e}")
self.callback.error(
f"Failed to ingest faqs into the database: {e}",
exception=e,
tokens=self.tokens,
)
cremertim marked this conversation as resolved.
Show resolved Hide resolved

def delete_old_faqs(
self, faqs: list[FaqDTO]
):
"""
Delete the faq from the database
"""
try:
for faq in faqs:
if self.delete_faq(faq.faq_id, faq.course_id):
logger.info("Faq deleted successfully")
else:
logger.error("Failed to delete faq")
self.callback.done("Old faqs removed")
except Exception as e:
logger.error(f"Error deleting faqs: {e}")
self.callback.error("Error while removing old faqs")
return False

def delete_faq(self, faq_id, course_id):
"""
Delete the faq from the database
"""
try:
self.collection.data.delete_many(
where=Filter.by_property(FaqSchema.FAQ_ID.value).equal(faq_id)
& Filter.by_property(FaqSchema.COURSE_ID.value).equal(course_id)

)
logging.info(f"successfully deleted faq with id {faq_id}")
return True
except Exception as e:
logger.error(f"Error deleting faq: {e}", exc_info=True)
return False


def chunk_data(self, path: str) -> List[Dict[str, str]]:
"""
Faqs are so small, they do not need to be chunked into smaller parts
"""
return

32 changes: 32 additions & 0 deletions app/pipeline/prompts/faq_citation_prompt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
In the paragraphs below you are provided with an answer to a question. Underneath the answer you will find the faqs that the answer was based on.
Add citations of the faqs to the answer. Cite the faqs in brackets after the sentence where the information is used in the answer.
At the end of the answer, list each source with its corresponding number and provide the FAQ Question title, and a clickable link in this format: [1] <a href="URL">"FAQ Question title"</a>.
Do not include the actual faqs, only the citations at the end.
Please do not use the FAQ ID as the citation number, instead, use the order of the citations in the answer.
Only include the citations of the faqs that are relevant to the answer.
If the answer actually does not contain any information from the faqs, please do not include any citations and return '!NONE!'.
But if the answer contains information from the paragraphs, ALWAYS include citations.

Here is an example how to rewrite the answer with citations (ONLY ADD CITATION IF THE PROVIDED FAQS ARE RELEVANT TO THE ANSWER):
"
Lorem ipsum dolor sit amet, consectetur adipiscing elit [1]. Ded do eiusmod tempor incididunt ut labore et dolore magna aliqua [2].

[1] <a href="http://localhost:9000/courses/1/faq?faqId=1">FAQ question title 1</a>.
[2] <a href="http://localhost:9000/courses/1/faq?faqId=2">FAQ question title 2</a>.
"

Note: If there is no link available, please do not include the link in the citation. For example, if citation 1 does not have a link, it should look like this:
[1] "FAQ question title"
but if citation 2 has a link, it should look like this:
[2] <a href="URL">"FAQ question title"</a>

Here are the answer and the faqs:

Answer without citations:
{Answer}

Faqs with their FAQ ID, CourseId, FAQ Question title and FAQ Question Answer and the Link to the FAQ:
{Paragraphs}

Answer with citations (ensure empty line between the message and the citations):
If the answer actually does not contain any information from the paragraphs, please do not include any citations and return '!NONE!'.
Loading
Loading