-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter duplicate vectors when pruning vectors #5397
Comments
Just faced the issue today when tried to prune previously pruned vocabulary. Wonder if the fix can consist of forming a mask for indices/keys which picks up first n_row unique rows. Then use inversed mask to pick up the rest. Not deeply tested, just a raw idea. Can be implemented like following
|
Sorry this didn't work as expected, and thanks for the suggestion! This issue is kind of low priority on our end right now, but we'll try to come back to it when we find a bit of time. |
How to reproduce the behaviour
When prioritizing vectors to keep,
Vocab.prune_vectors
doesn't handle existing duplicates fromkey2row
well. By sorting/prioritizing by values fromkey2row
, which may contain duplicate values,prune_vectors
may keep multiple copies of the same vector.Fix:
indices
and adjustkeys
accordinglykey2row
keys (which is not compatible with the keys truncation from Fix most_similar for vectors with unused rows #5348, which is overly simple; I think you have to re-add the duplicate rows to vectors after initialization withVectors.add(row=)
)The text was updated successfully, but these errors were encountered: