-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize pgvector test for semi-recent enhancements #319
Conversation
/assign @alwayslove2013 |
@XuanYang-cn @alwayslove2013 Please let us know if this PR requires additional work. There are some other changes we'd like to include for testing other configurations of pgvector, but we'd like to baseline it against the flat implementation first. Thanks! |
from abc import abstractmethod | ||
from typing import Any, Mapping, Optional, Sequence, TypedDict | ||
|
||
from psycopg import sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend avoiding adding specific dependencies in the config.py
. Users only install the corresponding toolkit when they are conducting tests.
However, in the scenario where the default results page is opened solely for result display, VDBBench
will load the standardized result (json). Serializing this data requires the config.py
file from all clients.
Currently, if a user hasn't installed psycopg
, they won't be able to open the results page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed in ea29f47
@alwayslove2013 Thanks for the feedback! This is resolved in the latest push. Overall, I would suggest moving to psycopg3 ( |
@jkatz Thank you so much for your contribution! We greatly appreciate it and are thrilled to receive your pull request. We look forward to collaborating with you and driving the project forward together!
|
This commit adds several changes to the pgvector test to create a more representative test environment based on recent and older changes to pgvector. Notable changes include allowing for testing of parallel index buiding parameters, using loading with the recommended binary loading method, and other changes to better emulate what a typical user of pgvector would do. This commit also has some general cleanups as well. Co-authored-by: Mark Greenhalgh <greenhal@users.noreply.github.com> Co-authored-by: Tyler House <tahouse@users.noreply.github.com>
@alwayslove2013 Likewise. I personally appreciate the approach VectorDBBench takes around testing concurrency, which resembles how users interact with databases. I've pushed up the fix to the latest patch to handle the merge conflict that remained (which I'm still baffled how that got in, but I'll triple check next time). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alwayslove2013, jkatz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jkatz I would like to express my sincere gratitude for your support. Our primary goal has been to ensure that the test data reflects the performance characteristics of the real-world usage scenarios as accurately as possible.
If you have any suggestions or innovative ideas with VDBBench, we would be more than happy to discuss them with you. Your valuable input is crucial for us to enhance the functionality and user experience of the tool. |
@jkatz I'm trying to see how we can optimize the pgvector and pgvecto.rs client further and one thought I had was to use |
This benchmark is pretty simple, run the query, get the result, compare the result. To improve the performance, using stored procedures you would have to either reduce network round trips, or reduce the data transmitted, neither of which is much overhead in this benchmark. Using stored procedures HammerDB does both of these, because there is a lot of potential network traffic associated with a tpcc benchmark, for example see new order transactions, to create a new order (the main tpcc performance measurement), it could take up to 6 round trips to create an order, if all of the logic was on the client, using stored procedures in this case, it's 1 or even less, because hammer db can send a single request that says, create 100 orders. With that in mind, there could be 2 ways that could possibly improve pgvector performance, or any engines performance. (which could be implemented with stored procedures.)
Both would require a pretty significant change to vectordbbench, they would be engine specific and would not represent a real world use case. |
This commit adds several changes to the pgvector test to create a more representative test environment based on recent and older changes to pgvector. Notable changes include allowing for testing of parallel index buiding parameters, using loading with the recommended binary loading method, and other changes to better emulate what a typical user of pgvector would do.
This commit also has some general cleanups as well.
Co-authored-by: Mark Greenhalgh greenhal@users.noreply.github.com
Co-authored-by: Tyler House tahouse@users.noreply.github.com