Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research and set up vector database system #3

Closed
Nyumat opened this issue Oct 22, 2024 · 12 comments · Fixed by #28
Closed

Research and set up vector database system #3

Nyumat opened this issue Oct 22, 2024 · 12 comments · Fixed by #28
Assignees
Labels
advanced Great for experienced developers blocking Requires completion of another task

Comments

@Nyumat
Copy link
Member

Nyumat commented Oct 22, 2024

Vector Databases are the core of how we can power ML applications to do things such as retrieve relevant information. I could talk about it a lot more in-depth, but I recommend you read the linked article from Pinecone, the leading vector DB solution, if you'd like to know more.

For our options—there's many, many, tools we can use here—so I'll list them out:

Pinecone
Supabase
Dewy

I could go on and on, but I found this nice comparison on reddit of the popular options.

Image

Some others include Astra and Momento.

@Nyumat Nyumat converted this from a draft issue Oct 22, 2024
@Nyumat Nyumat added blocking Requires completion of another task advanced Great for experienced developers labels Oct 22, 2024
@cshafizadeh
Copy link
Contributor

Options:

  1. pgvector (PostgreSQL + pgvector extension): SQL is popular and is easy to learn, so it should be widely accesible. postgres also doesn't have a db size limitation, however the db will have to be self-hosted. https://github.com/pgvector/pgvector
  2. Pinecone: Cloud based and provides documentation and steps on setting up vector db (easy to learn). The free plan does have a 2GB database size limit. Includes built in distance metrics (euclidean, cosine, dot product)
  3. ChromaDB: Open source vector DB similar to pgvector but more refined for vectordb support. Has metadata support, allowing context to be added to embeddings.
Feature pgvector (PostgreSQL + pgvector) Pinecone ChromaDB
Purpose General-purpose SQL DB with vector support Dedicated vector database Vector-focused database
Querying SQL + vector similarity (cosine, inner product, L2) Vector similarity search with optimized ANN Vector similarity, metadata filtering
Deployment Complexity Moderate; requires PostgreSQL setup and tuning Easy setup on managed cloud platform Lightweight, minimal setup
Scalability Highly scalable, enterprise-ready High scalability, serverless Good for small to mid-scale applications
Performance Slower for high-frequency vector-only searches Optimized for high-performance vector search Fast for vector-focused workloads
Storage Options Fully durable, transactional SQL storage Managed cloud storage In-memory, file-based (DuckDB + Parquet)
Metadata Support Yes, through PostgreSQL’s JSONB fields Yes, with filtering options Yes, with filtering options
Best Use Case Mixed data needs (SQL + vectors), enterprise workloads Large-scale, high-performance vector workloads Vector search with lightweight persistence
Open Source Yes, with PostgreSQL (pgvector extension) No, proprietary Yes
Cost Depends on PostgreSQL deployment Paid, with free tier for limited use Free and open-source
Dimension Limits Handles high dimensions but may need optimization High-dimensional support High-dimensional support
Support for Filtering Advanced filtering, SQL-based queries Supports metadata filtering Supports metadata filtering
Management Interface SQL client, pgAdmin, managed SQL services Pinecone Console CLI and simple API

Note: An embeddings model will need to be chosen to go with the vector database.

@Nyumat
Copy link
Member Author

Nyumat commented Oct 26, 2024

Nice, thanks for listing this out.

What do you think of using pgvector and supabase? So we get the best of both worlds? A customizable system for storing vectors, and a managed instance of the database for easy setup?

Check out this

It seems like all new Supabase db's support vector hsnw now too. Idk, what's your take? Pinecone and Chroma would be nice as well.

What we really want is a system which will allow us to store the most vectors, the easiest

@cshafizadeh
Copy link
Contributor

Thats a pretty good idea! I think that supabase combined with pgvector would be a really powerful combo. It would us to use SQL, reducing the learning curve of the db, and supabase would allow us to host it on the cloud and have a central db instead of individual ones. HOWEVER, the database size on the free plan of Supabase is only 500mb (https://supabase.com/pricing). If the goal is a system to store the most vectors, then we would need to know roughly how many documents we plan on uploading and wether that would be under 500mb. For reference, uploading a chunk of 2000 words would take up ~10 KB. We could possibly filter out stopwords from the chunks to make them smaller but would need to test if that affects context given to the LLM at runtime. Supabase does have really good intergration with pgvector though, and I like the support for the new hsnq indexes.

I think the supabase + pgvector combo is the best pick for the project as long as we don't think we are going to go over 500mb of chunks.

@Nyumat
Copy link
Member Author

Nyumat commented Oct 28, 2024

Thats a pretty good idea! I think that supabase combined with pgvector would be a really powerful combo. It would us to use SQL, reducing the learning curve of the db, and supabase would allow us to host it on the cloud and have a central db instead of individual ones. HOWEVER, the database size on the free plan of Supabase is only 500mb (https://supabase.com/pricing). If the goal is a system to store the most vectors, then we would need to know roughly how many documents we plan on uploading and wether that would be under 500mb. For reference, uploading a chunk of 2000 words would take up ~10 KB. We could possibly filter out stopwords from the chunks to make them smaller but would need to test if that affects context given to the LLM at runtime. Supabase does have really good intergration with pgvector though, and I like the support for the new hsnq indexes.

I think the supabase + pgvector combo is the best pick for the project as long as we don't think we are going to go over 500mb of chunks.

Oh my goodness, thank you, Cyrus, for doing the research and finding that. I really appreciate it! Especially because, that 500MB limit is definitely something!!! 😅

We DONT want chunk storage to be a bottleneck, especially if people want to upload lecture videos or other multimodal content in the future. I think Pinecone, or one of the other options mentioned would be a better fit since there's less scale concerns, and we can easily add metadata filters for each course to handle everything more efficiently.

Let's chat about this at our next meeting. Lots of interesting stuff here.

@s2xon
Copy link

s2xon commented Oct 28, 2024

@Nyumat Are we looking to use JS for the entire project? Would something like Go be considered at all?

@Nyumat
Copy link
Member Author

Nyumat commented Oct 29, 2024

@Nyumat Are we looking to use JS for the entire project? Would something like Go be considered at all?

Are we looking to use JS for the entire project?

Yeah, ideally. Most of these AI SDKs have Python and TS/JS wrappers so that's why it's relevant here.

Would something like Go be considered at all?

I'm not opposed, but would we gain from it that say, Python wouldn't provide? And where would we look to add it?

I've seen stuff like Gofast out and about, but seems like the benefits we'd get from Go (speed, better grpc libs, multi-threading) are out of scope for this project, at least.

Curious to hear what you think.

@s2xon
Copy link

s2xon commented Oct 29, 2024

@Nyumat Are we looking to use JS for the entire project? Would something like Go be considered at all?

Are we looking to use JS for the entire project?

Yeah, ideally. Most of these AI SDKs have Python and TS/JS wrappers so that's why it's relevant here.

Would something like Go be considered at all?

I'm not opposed, but would we gain from it that say, Python wouldn't provide? And where would we look to add it?

I've seen stuff like Gofast out and about, but seems like the benefits we'd get from Go (speed, better grpc libs, multi-threading) are out of scope for this project, at least.

Curious to hear what you think.

Yea, makes sense. I mention go just because it would be different and JS is always used for everything. The JS ecosystem has a new framework, run-time, library, etc come out every second 😂.

@Nyumat
Copy link
Member Author

Nyumat commented Oct 29, 2024

Lol, yeah agreed. Well, I'll tell you this.

When we did BeavsAI in the past, we had tons of memory inefficiencies that served as a bottleneck once we tried to product-ionize the application.

I did some research, and found there's a great langchain alternative for Go, used to build these AI applications. If you can get a running proof-of-concept (I can assist if need be as well, or we can do it live during our Wednesday meeting) — I'd be down to consider it, especially given all the performance we'd gain from using it, compared to Python and JS.

@cshafizadeh
Copy link
Contributor

Hey @Nyumat we didn't get a chance to talk at the meeting Wednesday. I think that the best step forward is Pinecone. We get 2GB of storage with the free tier and it looks pretty simple to set up. I'm gonna get to work on implementing the vector db this week, let me know if you have any questions or concerns.

@Nyumat
Copy link
Member Author

Nyumat commented Nov 2, 2024

Huge, I'll be on the look out! And yeah my bad, couldn't make the meeting in-person

@Nyumat
Copy link
Member Author

Nyumat commented Nov 10, 2024

@s2xon what do you think of assisting me in building a Discord bot (in go, as i've been enjoying using it recently again) to extend our current application's knowledge base?

Ideally, we'd create a Discord bot that first retrieves & from then on listens to messages in the CS Discord Server (forrmerly, church of evan), using the guild's messages to create embeddings that could be used as query matches.

User — "Who's the best professor to take for CS374? My options are gambord, guyer, and brewster"

Bot — "Based on previous conversations in this server, here's some feedback about your options:

  • Gambord: Known for clear explanations and detailed lectures, but exams can be challenging.
  • Guyer: Students often mention that he is approachable and provides good support outside class, though his lectures may - sometimes be fast-paced.
  • Brewster: Generally well-liked for practical assignments and real-world examples, but can be strict on deadlines
    Sources: msg-link, msg-link

@github-project-automation github-project-automation bot moved this from In Progress to Done in BeavsAI v2 Nov 18, 2024
@Nyumat
Copy link
Member Author

Nyumat commented Nov 18, 2024

Sadge ghosted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced Great for experienced developers blocking Requires completion of another task
Projects
Status: Done
3 participants