Research and set up vector database system #3

Nyumat · 2024-10-22T05:41:01Z

Vector Databases are the core of how we can power ML applications to do things such as retrieve relevant information. I could talk about it a lot more in-depth, but I recommend you read the linked article from Pinecone, the leading vector DB solution, if you'd like to know more.

For our options—there's many, many, tools we can use here—so I'll list them out:

Pinecone
Supabase
Dewy

I could go on and on, but I found this nice comparison on reddit of the popular options.

Some others include Astra and Momento.

cshafizadeh · 2024-10-25T23:29:17Z

Options:

pgvector (PostgreSQL + pgvector extension): SQL is popular and is easy to learn, so it should be widely accesible. postgres also doesn't have a db size limitation, however the db will have to be self-hosted. https://github.com/pgvector/pgvector
Pinecone: Cloud based and provides documentation and steps on setting up vector db (easy to learn). The free plan does have a 2GB database size limit. Includes built in distance metrics (euclidean, cosine, dot product)
ChromaDB: Open source vector DB similar to pgvector but more refined for vectordb support. Has metadata support, allowing context to be added to embeddings.

Feature	pgvector (PostgreSQL + pgvector)	Pinecone	ChromaDB
Purpose	General-purpose SQL DB with vector support	Dedicated vector database	Vector-focused database
Querying	SQL + vector similarity (cosine, inner product, L2)	Vector similarity search with optimized ANN	Vector similarity, metadata filtering
Deployment Complexity	Moderate; requires PostgreSQL setup and tuning	Easy setup on managed cloud platform	Lightweight, minimal setup
Scalability	Highly scalable, enterprise-ready	High scalability, serverless	Good for small to mid-scale applications
Performance	Slower for high-frequency vector-only searches	Optimized for high-performance vector search	Fast for vector-focused workloads
Storage Options	Fully durable, transactional SQL storage	Managed cloud storage	In-memory, file-based (DuckDB + Parquet)
Metadata Support	Yes, through PostgreSQL’s JSONB fields	Yes, with filtering options	Yes, with filtering options
Best Use Case	Mixed data needs (SQL + vectors), enterprise workloads	Large-scale, high-performance vector workloads	Vector search with lightweight persistence
Open Source	Yes, with PostgreSQL (pgvector extension)	No, proprietary	Yes
Cost	Depends on PostgreSQL deployment	Paid, with free tier for limited use	Free and open-source
Dimension Limits	Handles high dimensions but may need optimization	High-dimensional support	High-dimensional support
Support for Filtering	Advanced filtering, SQL-based queries	Supports metadata filtering	Supports metadata filtering
Management Interface	SQL client, pgAdmin, managed SQL services	Pinecone Console	CLI and simple API

Note: An embeddings model will need to be chosen to go with the vector database.

Nyumat · 2024-10-26T22:06:00Z

Nice, thanks for listing this out.

What do you think of using pgvector and supabase? So we get the best of both worlds? A customizable system for storing vectors, and a managed instance of the database for easy setup?

Check out this

It seems like all new Supabase db's support vector hsnw now too. Idk, what's your take? Pinecone and Chroma would be nice as well.

What we really want is a system which will allow us to store the most vectors, the easiest

cshafizadeh · 2024-10-27T02:44:15Z

Thats a pretty good idea! I think that supabase combined with pgvector would be a really powerful combo. It would us to use SQL, reducing the learning curve of the db, and supabase would allow us to host it on the cloud and have a central db instead of individual ones. HOWEVER, the database size on the free plan of Supabase is only 500mb (https://supabase.com/pricing). If the goal is a system to store the most vectors, then we would need to know roughly how many documents we plan on uploading and wether that would be under 500mb. For reference, uploading a chunk of 2000 words would take up ~10 KB. We could possibly filter out stopwords from the chunks to make them smaller but would need to test if that affects context given to the LLM at runtime. Supabase does have really good intergration with pgvector though, and I like the support for the new hsnq indexes.

I think the supabase + pgvector combo is the best pick for the project as long as we don't think we are going to go over 500mb of chunks.

Nyumat · 2024-10-28T06:34:54Z

Thats a pretty good idea! I think that supabase combined with pgvector would be a really powerful combo. It would us to use SQL, reducing the learning curve of the db, and supabase would allow us to host it on the cloud and have a central db instead of individual ones. HOWEVER, the database size on the free plan of Supabase is only 500mb (https://supabase.com/pricing). If the goal is a system to store the most vectors, then we would need to know roughly how many documents we plan on uploading and wether that would be under 500mb. For reference, uploading a chunk of 2000 words would take up ~10 KB. We could possibly filter out stopwords from the chunks to make them smaller but would need to test if that affects context given to the LLM at runtime. Supabase does have really good intergration with pgvector though, and I like the support for the new hsnq indexes.

I think the supabase + pgvector combo is the best pick for the project as long as we don't think we are going to go over 500mb of chunks.

Oh my goodness, thank you, Cyrus, for doing the research and finding that. I really appreciate it! Especially because, that 500MB limit is definitely something!!! 😅

We DONT want chunk storage to be a bottleneck, especially if people want to upload lecture videos or other multimodal content in the future. I think Pinecone, or one of the other options mentioned would be a better fit since there's less scale concerns, and we can easily add metadata filters for each course to handle everything more efficiently.

Let's chat about this at our next meeting. Lots of interesting stuff here.

s2xon · 2024-10-28T20:34:17Z

@Nyumat Are we looking to use JS for the entire project? Would something like Go be considered at all?

Nyumat · 2024-10-29T02:47:18Z

@Nyumat Are we looking to use JS for the entire project? Would something like Go be considered at all?

Are we looking to use JS for the entire project?

Yeah, ideally. Most of these AI SDKs have Python and TS/JS wrappers so that's why it's relevant here.

Would something like Go be considered at all?

I'm not opposed, but would we gain from it that say, Python wouldn't provide? And where would we look to add it?

I've seen stuff like Gofast out and about, but seems like the benefits we'd get from Go (speed, better grpc libs, multi-threading) are out of scope for this project, at least.

Curious to hear what you think.

s2xon · 2024-10-29T16:33:55Z

@Nyumat Are we looking to use JS for the entire project? Would something like Go be considered at all?

Are we looking to use JS for the entire project?

Yeah, ideally. Most of these AI SDKs have Python and TS/JS wrappers so that's why it's relevant here.

Would something like Go be considered at all?

I'm not opposed, but would we gain from it that say, Python wouldn't provide? And where would we look to add it?

I've seen stuff like Gofast out and about, but seems like the benefits we'd get from Go (speed, better grpc libs, multi-threading) are out of scope for this project, at least.

Curious to hear what you think.

Yea, makes sense. I mention go just because it would be different and JS is always used for everything. The JS ecosystem has a new framework, run-time, library, etc come out every second 😂.

Nyumat · 2024-10-29T20:40:10Z

Lol, yeah agreed. Well, I'll tell you this.

When we did BeavsAI in the past, we had tons of memory inefficiencies that served as a bottleneck once we tried to product-ionize the application.

I did some research, and found there's a great langchain alternative for Go, used to build these AI applications. If you can get a running proof-of-concept (I can assist if need be as well, or we can do it live during our Wednesday meeting) — I'd be down to consider it, especially given all the performance we'd gain from using it, compared to Python and JS.

cshafizadeh · 2024-11-01T20:45:22Z

Hey @Nyumat we didn't get a chance to talk at the meeting Wednesday. I think that the best step forward is Pinecone. We get 2GB of storage with the free tier and it looks pretty simple to set up. I'm gonna get to work on implementing the vector db this week, let me know if you have any questions or concerns.

Nyumat · 2024-11-02T01:11:31Z

Huge, I'll be on the look out! And yeah my bad, couldn't make the meeting in-person

Nyumat · 2024-11-10T08:53:53Z

@s2xon what do you think of assisting me in building a Discord bot (in go, as i've been enjoying using it recently again) to extend our current application's knowledge base?

Ideally, we'd create a Discord bot that first retrieves & from then on listens to messages in the CS Discord Server (~~forrmerly, church of evan~~), using the guild's messages to create embeddings that could be used as query matches.

User — "Who's the best professor to take for CS374? My options are gambord, guyer, and brewster"

Bot — "Based on previous conversations in this server, here's some feedback about your options:

Gambord: Known for clear explanations and detailed lectures, but exams can be challenging.

Guyer: Students often mention that he is approachable and provides good support outside class, though his lectures may - sometimes be fast-paced.

Brewster: Generally well-liked for practical assignments and real-world examples, but can be strict on deadlines
Sources: msg-link, msg-link

Nyumat · 2024-11-18T20:57:37Z

Sadge ghosted

Nyumat added this to BeavsAI v2 Oct 22, 2024

Nyumat converted this from a draft issue Oct 22, 2024

Nyumat added blocking Requires completion of another task advanced Great for experienced developers labels Oct 22, 2024

Nyumat assigned cshafizadeh Nov 4, 2024

Nyumat linked a pull request Nov 9, 2024 that will close this issue

feat: full-stack RAG pipeline for PDF ingestion and contextual chat capabilities #28

Merged

owenkrause closed this as completed in #28 Nov 18, 2024

github-project-automation bot moved this from In Progress to Done in BeavsAI v2 Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research and set up vector database system #3

Research and set up vector database system #3

Nyumat commented Oct 22, 2024

cshafizadeh commented Oct 25, 2024

Nyumat commented Oct 26, 2024 •

edited

Loading

cshafizadeh commented Oct 27, 2024

Nyumat commented Oct 28, 2024

s2xon commented Oct 28, 2024

Nyumat commented Oct 29, 2024

s2xon commented Oct 29, 2024 •

edited

Loading

Nyumat commented Oct 29, 2024

cshafizadeh commented Nov 1, 2024

Nyumat commented Nov 2, 2024

Nyumat commented Nov 10, 2024

Nyumat commented Nov 18, 2024

Research and set up vector database system #3

Research and set up vector database system #3

Comments

Nyumat commented Oct 22, 2024

cshafizadeh commented Oct 25, 2024

Options:

Nyumat commented Oct 26, 2024 • edited Loading

cshafizadeh commented Oct 27, 2024

Nyumat commented Oct 28, 2024

s2xon commented Oct 28, 2024

Nyumat commented Oct 29, 2024

s2xon commented Oct 29, 2024 • edited Loading

Nyumat commented Oct 29, 2024

cshafizadeh commented Nov 1, 2024

Nyumat commented Nov 2, 2024

Nyumat commented Nov 10, 2024

Nyumat commented Nov 18, 2024

Nyumat commented Oct 26, 2024 •

edited

Loading

s2xon commented Oct 29, 2024 •

edited

Loading