-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hub integration #58
base: main
Are you sure you want to change the base?
Add hub integration #58
Conversation
- Updated README.md to clarify that items can be strings or other serializable objects. - Modified the Vicinity class to accept a broader range of item types by changing type hints from `str` to `Any` in several methods. - Enhanced the insert and delete methods to handle non-string tokens appropriately, ensuring that items can be checked and managed regardless of their type.
… and evaluating backends
- Simplified the logic for checking and appending tokens in the insert method, ensuring that duplicate tokens are properly managed.
- Updated the `items` fixture to return a mix of dictionaries and strings based on index parity. - Modified `test_vicinity_insert_duplicate` to use the updated `items` fixture for inserting items. - Adjusted `test_vicinity_delete_and_query` to reference items by their indices instead of hardcoded values. - Enhanced the Vicinity class to streamline token management, ensuring proper handling of duplicates and improving error messaging for token deletions.
Co-authored-by: Stephan Tulkens <stephantul@gmail.com>
…ling - Replaced the nested loop for checking duplicates with a single extend operation for tokens. - Improved efficiency by directly appending tokens to the items list, ensuring proper management of duplicates.
…ling - Replaced the nested loop for token matching with a more efficient list comprehension. - Enhanced error messaging to specify which tokens were not found in the vector space.
- Added a try-except block around the JSON serialization process to catch JSONEncodeError.
- Introduced a new pytest fixture `non_serializable_items` that generates a list of non-serializable objects for testing. - Added a test case `test_vicinity_save_and_load_non_serializable_items` to verify that saving a Vicinity instance with non-serializable items raises a JSONEncodeError. - Updated the Vicinity class documentation to specify that JSONEncodeError may be raised if items are not serializable.
- Introduced HuggingFaceMixin to enable saving and loading Vicinity instances to/from Hugging Face Hub - Added optional import of HuggingFaceMixin based on huggingface_hub and datasets library availability - Implemented methods for pushing Vicinity instances to the Hub, including dataset and metadata upload - Created a method to load Vicinity instances from Hugging Face repositories
Codecov ReportAttention: Patch coverage is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice integration, would be awesome to have this in Vicinity 🎉 Left some minor feedback
) | ||
|
||
# DatasetCard | ||
DatasetCard( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would move this to a file dataset_card_template.md, bit cleaner imo
""" | ||
Save the Vicinity instance to the Hugging Face Hub. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion but the docstrings are inconsistent (load_from_hub
uses the docstring format that we usually use).
|
||
:param repo_id: The repository ID on the Hugging Face Hub. | ||
:param token: Optional authentication token for private repositories. | ||
:param kwargs: Additional arguments passed to load_dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param kwargs: Additional arguments passed to load_dataset. | |
:param **kwargs: Additional arguments passed to load_dataset. |
Otherwise pre-commit will complain
@@ -0,0 +1,162 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add from __future__ import annotations
so | is supported for typing (python 3.9 support)
|
||
|
||
class HuggingFaceMixin: | ||
def save_to_hub( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just calls push_to_hub right? Is there a specific reason for not directly calling push_to_hub, or having the logic live in this function?
@@ -19,8 +20,13 @@ | |||
|
|||
logger = logging.getLogger(__name__) | |||
|
|||
if importlib.util.find_spec("huggingface_hub") is not None and importlib.util.find_spec("datasets") is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an optional dependency group imo (can be huggingface
, or integrations
, or both and integration
installs all integrations). Right now, I think if someone has not installed these libs and tries to run vicinity.save_to_hub()
, it will just crash, it would be nicer if there's a specific error thrown to show that the optional dependency is not installed. Similar to this: https://github.com/MinishLab/model2vec/blob/main/model2vec/distill/__init__.py#L5
Also feel free to add an example to the main README (e.g. in the quickstart there's a saving/loading part, it can be added under there) |
I added a basic integration for the Hub.
Resulting in https://huggingface.co/datasets/davidberenstein1957/my-vicinity-repo
You can also find datasets here: https://huggingface.co/datasets?other=vicinity