Snowflake Data Pipeline for Semantic Search and Topic Modelling (BERTopic)

This project implements an end-to-end data pipeline in Snowflake to enable semantic search and topic modeling on documents. It empowers users to efficiently retrieve and analyze documents using keyword/s, and to uncover hidden themes and topics within the documents.

Overview

The project consists of a few main components:

Data Ingestion and Transformation: Implemented using Snowflake's SQL and Snowpark worksheets for Extract, Load, and Transform (ELT) processes.
Semantic Search: Utilizes pre-trained language model "all-MiniLM-L6-v2" for text embedding and computing cosine-similarity scores.
Topic Modelling: Utilizes the BERTopic framework/ model to identify and extract hidden themes and topics from the document corpus.

Architecture

Stored Procedures

The Stored_Procedures directory contains the stored procedures used in the Snowflake environment.

LOAD_AND_REFRESH_DATABASE_sub-procedures

This directory contains stored procedures related to loading and refreshing the database.

SP1.1_load_to_table_from_stage.txt: Script to load data into Snowflake table from staging area.
SP1.2_snowpark_transformations.txt: Script for Snowpark transformations.
SP1.3_chunking_text.txt: Script for chunking text.
SP1.4_embedding_chunks.txt: Script for embedding chunks.
SP1.5_tokenizing_chunks.txt: Script for tokenizing chunks.
SP1.6_embedding_tokens.txt: Script for embedding tokens.
SP1.7_dimension_reduction.txt: Script for reducing vector dimensions

RUN_SIMILARITY_SEARCH_sub-procedures

This directory contains stored procedures related to running similarity searches.

SP2.1_INSERT_QUERY.txt: Script to insert queries into Snowflake table.
SP2.2_GET_SIMILARITY.txt: Script to get similarity score between query and every chunk.
SP2.3_GET_TOKEN_SIMILARITY.txt: Script to get token similarity score between query and every token.
SP2.4_GET_TOP_5_TOKENS.txt: Script to get top 5 tokens for each chunk.
SP2.5_VIEW_CREATION.txt: Script to create a temporary table for subsequent topic modelling step
SP2.6_ENTITY_EXTRACTION.txt: Script to extract only the organization entities from the NER field

Main Scripts/ Procedures

main_LOAD_AND_REFRESH_DATABASE.txt: Main script for running all the 5 sub-procedures sequentially to load and refresh the database
main_RUN_SIMILARITY_SEARCH.txt: Main script for running all the 7 sub-procedures of similarity search
main_BERTopic_Modelling.txt: Main script for performing topic modeling using the BERTopic framework.

User-Defined Functions

The User_Defined_Functions directory contains user-defined functions used in the Snowflake environment.

udf_CHUNK_TEXT.txt: Function to chunk text into smaller paragraphs/ sentences
udf_MiniLM_L6_v2_embedding.txt: Function to embed text into vectors using MiniLM-L6-v2.
udf_py_spacy.txt: Function using Python's spaCy library for text processing and tokenizing.
udf_sklearn_cosine_similarity.txt: Function to compute cosine similarity score using scikit-learn.
udf_py_spacy_NER.txt: Function using Python's spaCy library for Named Entity Recognition.

Usage

Set up Environment: Before executing the scripts, ensure you have access to a Snowflake environment with the necessary permissions.
Execution Steps:
- Open Snowflake's SQL or Snowpark worksheet.
- Copy the contents of each script and execute them sequentially to create the sub-procedures/ functions/ main-procedures
- To run the main procedures just use the call function
  - CALL LOAD_AND_REFRESH_DATABASE()
  - CALL RUN_SIMILARITY_SEARCH('QUERY')
  - CALL TOPIC_MODEL('limit')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Snowflake Data Pipeline for Semantic Search and Topic Modelling (BERTopic)

Overview

Architecture

Stored Procedures

LOAD_AND_REFRESH_DATABASE_sub-procedures

RUN_SIMILARITY_SEARCH_sub-procedures

Main Scripts/ Procedures

User-Defined Functions

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Snowflake Data Pipeline for Semantic Search and Topic Modelling (BERTopic)

Overview

Architecture

Stored Procedures

LOAD_AND_REFRESH_DATABASE_sub-procedures

RUN_SIMILARITY_SEARCH_sub-procedures

Main Scripts/ Procedures

User-Defined Functions

Usage