From b9fd0f542f82125cf107e3b15a60b61cfe287441 Mon Sep 17 00:00:00 2001 From: Michael Shteyn <5659756+mshteyn@users.noreply.github.com> Date: Wed, 5 Jun 2024 16:41:44 -0400 Subject: [PATCH 1/4] Update README.md --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 72aa64e..08c3dbd 100644 --- a/README.md +++ b/README.md @@ -55,16 +55,20 @@ Our final database consisted of 190,000 restaurant reviews from over 20,000 inde ## Exploratory Data Analysis

image

-A visualization of restaurants in the Philadelphia area sorted by how many times they have been reviewed by users through the Google Local API. +Visualization of restaurants in the Philadelphia area sorted by how many times they have been reviewed by users through the Google Local API.

image

-A histogram depicting the lenghts of reviews stored in our vector database. +Histogram depicts the lenghts of reviews stored in our vector database. ## Modeling Approach Text reviews were embedded as 1024 dimensional vectors using Alibaba's sentence transformer model (GTE-Large v1.5) and stored in a Pinecone vector database. User queries were embedded at runtime and compared to stored embeddings with cosine similarity. The top 5 closest reviews to the user query were retreived from the database and provided the context with which Llama 2 (13B Instruction-tuned) was prompted before generating a response to the user query. +

image

+ +Worflow. + ## Model Evaluation

image

From 96cec5aefe00dbe9dd53c8fe0515da3bc20b0dd7 Mon Sep 17 00:00:00 2001 From: Michael Shteyn <5659756+mshteyn@users.noreply.github.com> Date: Wed, 5 Jun 2024 16:42:18 -0400 Subject: [PATCH 2/4] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 08c3dbd..f4a8346 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,7 @@ identify and filter reviews containing specific mention of food items. Our final database consisted of 190,000 restaurant reviews from over 20,000 independent restaurants. ## Exploratory Data Analysis -

image

+

image

Visualization of restaurants in the Philadelphia area sorted by how many times they have been reviewed by users through the Google Local API. From 455d6409ced936af935a4ac5592586630ff9dd1e Mon Sep 17 00:00:00 2001 From: Michael Shteyn <5659756+mshteyn@users.noreply.github.com> Date: Wed, 5 Jun 2024 16:42:52 -0400 Subject: [PATCH 3/4] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f4a8346..17af797 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,7 @@ identify and filter reviews containing specific mention of food items. Our final database consisted of 190,000 restaurant reviews from over 20,000 independent restaurants. ## Exploratory Data Analysis -

image

+

image

Visualization of restaurants in the Philadelphia area sorted by how many times they have been reviewed by users through the Google Local API. @@ -65,7 +65,7 @@ Histogram depicts the lenghts of reviews stored in our vector database. Text reviews were embedded as 1024 dimensional vectors using Alibaba's sentence transformer model (GTE-Large v1.5) and stored in a Pinecone vector database. User queries were embedded at runtime and compared to stored embeddings with cosine similarity. The top 5 closest reviews to the user query were retreived from the database and provided the context with which Llama 2 (13B Instruction-tuned) was prompted before generating a response to the user query. -

image

+

image

Worflow. From d50e04a72d1684ea12d6c613cbacd5143e31c212 Mon Sep 17 00:00:00 2001 From: Michael Shteyn <5659756+mshteyn@users.noreply.github.com> Date: Wed, 5 Jun 2024 16:43:45 -0400 Subject: [PATCH 4/4] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 17af797..d3a14c2 100644 --- a/README.md +++ b/README.md @@ -88,7 +88,7 @@ Flavor-Finder achieved an average score of 3.1 out of 4, outperforming the origi ## Challenges -Significant GPU resources are required to load the necessary components of the model. +GPU resources are required to perform inference efficient. Updating the vector database requires access to subscription-based Google API keys which were beyond our budget. We've developed a tool that enables live scrapping of the Google Local API to perform database updates which we have updated within the limits of free use. As a result, our vector database is necessarily dated by the age of the dataset we had access to, containing reviews through 2020.