diff --git a/dataset_preparation/amazon_products/readme.md b/dataset_preparation/amazon_products/readme.md new file mode 100644 index 000000000..8e9a6e5a9 --- /dev/null +++ b/dataset_preparation/amazon_products/readme.md @@ -0,0 +1,5 @@ +This dataset contains around 2M vectors for amazon products. +The embeddings are generated using cohere-english-light model (https://huggingface.co/Cohere/Cohere-embed-english-light-v3.0) +The base text used for generating embeddings is title + description of products +The queries are modifications of randomly sampled products from the base: after sampling, we prompt GPT-3.5 to output a simple query phrase for which the product is a suitable result, and embed that phrase using the cohere model. +We also choose brands from the appropriate category of the query and provide them as OR filters. The item price of the sampled item is used as indicative for a PRICE range filter.