Dataset | Object Num | Feature Vector Dim | Query Num | Type | Download (Vector) |
---|---|---|---|---|---|
SIFT1M | 1,000,000 | 128 | 10,000 | Image | sift.tar.gz (161MB) |
GIST1M | 1,000,000 | 960 | 1,000 | Image | gist.tar.gz (2.6GB) |
GloVe | 1,183,514 | 100 | 10,000 | Text | glove-100.tar.gz (424MB) |
Crawl | 1,989,995 | 300 | 10,000 | Text | crawl.tar.gz (1.7GB) |
Audio | 53,387 | 192 | 200 | Audio | audio.tar.gz (26MB) |
Msong | 992,272 | 420 | 200 | Audio | msong.tar.gz (1.4GB) |
Enron | 94,987 | 1369 | 200 | Text | enron.tar.gz (51MB) |
UQ-V | 1,000,000 | 256 | 10,000 | Video | uqv.tar.gz (800MB) |
Paper | 2,029,997 | 200 | 10,000 | Text | paper.tar.gz (1.41GB) |
BIGANN100M | 100,000,000 | 128 | 10,000 | Image | bigann100m.tar.gz (9.2GB) |
NHQ在基础数据集上为每个对象添加属性,如为 SIFT1M 上的每张图像添加日期、位置、大小等属性,以形成一个具有特征向量和一组属性的对象。之后查询的真值文件通过论文中的Definition4
暴力计算得出,下面给出了已有标签文件和对应真值文件的链接。
所有原始对象和查询对象都转换为 fvecs 格式,而 groundtruth 数据则转换为 ivecs 格式
Dataset | NHQ Attributes Download | NHQ Ground Truth Download |
---|---|---|
SIFT1M | sift_attribute.tar.gz | label_sift_groundtruth.ivecs |
GIST1M | gist_attribute.tar.gz | label_gist_groundtruth.ivecs |
GloVe | glove-100_attribute.tar.gz | label_glove_groundtruth.ivecs |
Crawl | crawl_attribute.tar.gz | label_crawl_groundtruth.ivecs |
Audio | audio_attribute.tar.gz | label_audio_groundtruth.ivecs |
Msong | msong_attribute.tar.gz | label_msong_groundtruth.ivecs |
Enron | enron_attribute.tar.gz | label_enron_groundtruth.ivecs |
UQ-V | uqv_attribute.tar.gz | label_uqv_groundtruth.ivecs |
Paper | paper_attribute.tar.gz | label_paper_groundtruth.ivecs |
BIGANN100M | bigann100m_attribute.tar.gz | label_bigann100m_groundtruth.ivecs |
Attributes查询标签是单标签
groundtruth由代码中compute_groundtruth_for_filters生成
Dataset | Filtered−DiskANN Attributes Download |
---|---|
SIFT1M | sift_attribute.tar.gz |
GIST1M | gist_attribute.tar.gz |
GloVe | glove-100_attribute.tar.gz |
Audio | audio_attribute.tar.gz |
Msong | msong_attribute.tar.gz |
Paper | paper_attribute.tar.gz |
数据集中数据无属性标签,而是基础数据和查询数据是不同属性的数据。
1.text-to-image数据集:
base.10M.fbin: https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/base.10M.fbin
query.train.10M.fbin: https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.learn.50M.fbin
query.10k.fbin: https://storage.yandexcloud.net/yandex-research/ann-datasets/T2I/query.public.100K.fbin
gt.10k.ibin: https://zenodo.org/records/11090378/files/t2i.gt.10k.ibin
2.clip-webvid-2.5M数据集
base.2.5M.fbin: https://zenodo.org/records/11090378/files/clip.webvid.base.2.5M.fbin
query.train.2.5M.fbin: https://zenodo.org/records/11090378/files/webvid.query.train.2.5M.fbin
query.10k.fbin: https://zenodo.org/records/11090378/files/webvid.query.10k.fbin
gt.10k.ibin: https://zenodo.org/records/11090378/files/webvid.gt.10k.ibin
Dataset | Download |
---|---|
SIFT1M | sift1m.tar.gz |
Paper | paper.tar.gz |
HQANN在基础数据集上添加属性,基础数据集如SIFT、GIST等,属性采用NHQ中的属性文件
Dataset | Filtered−DiskANN Attributes Download |
---|---|
SIFT1M | sift_attribute.tar.gz |
GIST1M | gist_attribute.tar.gz |
GloVe | glove-100_attribute.tar.gz |
VBASE的数据集闭源了,将使用开源数据集实验