Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hub integration #58

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

davidberenstein1957
Copy link
Contributor

@davidberenstein1957 davidberenstein1957 commented Jan 28, 2025

I added a basic integration for the Hub.

import numpy as np

from vicinity import Backend, Metric, Vicinity

# Create some dummy data
items = [
    {"name": "triforce", "id": 0},
    {"name": "master sword", "id": 1},
    {"name": "hylian shield", "id": 2},
    {"name": "boomerang", "id": 3},
    {"name": "hookshot", "id": 4},
]
vectors = np.random.rand(len(items), 128)

# Initialize the Vicinity instance (using the basic backend and cosine metric)
vicinity = Vicinity.from_vectors_and_items(vectors=vectors, items=items)
vicinity.push_to_hub("davidberenstein1957/my-vicinity-repo")
vicinity = Vicinity.load_from_hub("davidberenstein1957/my-vicinity-repo")

Resulting in https://huggingface.co/datasets/davidberenstein1957/my-vicinity-repo
You can also find datasets here: https://huggingface.co/datasets?other=vicinity

davidberenstein1957 and others added 11 commits January 20, 2025 17:22
- Updated README.md to clarify that items can be strings or other serializable objects.
- Modified the Vicinity class to accept a broader range of item types by changing type hints from `str` to `Any` in several methods.
- Enhanced the insert and delete methods to handle non-string tokens appropriately, ensuring that items can be checked and managed regardless of their type.
- Simplified the logic for checking and appending tokens in the insert method, ensuring that duplicate tokens are properly managed.
- Updated the `items` fixture to return a mix of dictionaries and strings based on index parity.
- Modified `test_vicinity_insert_duplicate` to use the updated `items` fixture for inserting items.
- Adjusted `test_vicinity_delete_and_query` to reference items by their indices instead of hardcoded values.
- Enhanced the Vicinity class to streamline token management, ensuring proper handling of duplicates and improving error messaging for token deletions.
Co-authored-by: Stephan Tulkens <stephantul@gmail.com>
…ling

- Replaced the nested loop for checking duplicates with a single extend operation for tokens.
- Improved efficiency by directly appending tokens to the items list, ensuring proper management of duplicates.
…ling

- Replaced the nested loop for token matching with a more efficient list comprehension.
- Enhanced error messaging to specify which tokens were not found in the vector space.
- Added a try-except block around the JSON serialization process to catch JSONEncodeError.
- Introduced a new pytest fixture `non_serializable_items` that generates a list of non-serializable objects for testing.
- Added a test case `test_vicinity_save_and_load_non_serializable_items` to verify that saving a Vicinity instance with non-serializable items raises a JSONEncodeError.
- Updated the Vicinity class documentation to specify that JSONEncodeError may be raised if items are not serializable.
- Introduced HuggingFaceMixin to enable saving and loading Vicinity instances to/from Hugging Face Hub
- Added optional import of HuggingFaceMixin based on huggingface_hub and datasets library availability
- Implemented methods for pushing Vicinity instances to the Hub, including dataset and metadata upload
- Created a method to load Vicinity instances from Hugging Face repositories
@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
vicinity/vicinity.py 80.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
vicinity/vicinity.py 97.98% <80.00%> (+0.31%) ⬆️

@davidberenstein1957 davidberenstein1957 marked this pull request as draft January 28, 2025 21:20
@Pringled Pringled self-requested a review February 15, 2025 08:17
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice integration, would be awesome to have this in Vicinity 🎉 Left some minor feedback

)

# DatasetCard
DatasetCard(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would move this to a file dataset_card_template.md, bit cleaner imo

"""
Save the Vicinity instance to the Hugging Face Hub.

Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion but the docstrings are inconsistent (load_from_hub uses the docstring format that we usually use).


:param repo_id: The repository ID on the Hugging Face Hub.
:param token: Optional authentication token for private repositories.
:param kwargs: Additional arguments passed to load_dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param kwargs: Additional arguments passed to load_dataset.
:param **kwargs: Additional arguments passed to load_dataset.

Otherwise pre-commit will complain

@@ -0,0 +1,162 @@
import json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add from __future__ import annotations so | is supported for typing (python 3.9 support)



class HuggingFaceMixin:
def save_to_hub(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just calls push_to_hub right? Is there a specific reason for not directly calling push_to_hub, or having the logic live in this function?

@@ -19,8 +20,13 @@

logger = logging.getLogger(__name__)

if importlib.util.find_spec("huggingface_hub") is not None and importlib.util.find_spec("datasets") is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an optional dependency group imo (can be huggingface, or integrations, or both and integration installs all integrations). Right now, I think if someone has not installed these libs and tries to run vicinity.save_to_hub(), it will just crash, it would be nicer if there's a specific error thrown to show that the optional dependency is not installed. Similar to this: https://github.com/MinishLab/model2vec/blob/main/model2vec/distill/__init__.py#L5

@Pringled
Copy link
Member

Also feel free to add an example to the main README (e.g. in the quickstart there's a saving/loading part, it can be added under there)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants