add some risk warnings for custom dataset

- limit the number of test query vectors. Signed-off-by: min.tian <min.tian.cn@gmail.com>
zilliztech · Jan 20, 2025 · 4f21fcf · 4f21fcf
1 parent 811564a
commit 4f21fcf
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -319,6 +319,13 @@ We have strict requirements for the data set format, please follow them.
 - `Folder Path` - The path to the folder containing all the files. Please ensure that all files in the folder are in the `Parquet` format.
   - Vectors data files: The file must be named `train.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`.
   - Query test vectors: The file must be named `test.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`.
+    - We recommend limiting the number of test query vectors, like 1,000.
+    When conducting concurrent query tests, Vdbbench creates a large number of processes. 
+    To minimize additional communication overhead during testing, 
+    we prepare a complete set of test queries for each process, allowing them to run independently.
+    However, this means that as the number of concurrent processes increases, 
+    the number of copied query vectors also increases significantly, 
+    which can place substantial pressure on memory resources.
   - Ground truth file: The file must be named `neighbors.parquet` and should have two columns: `id` corresponding to query vectors and `neighbors_id` as an array of `int`.
 
 - `Train File Count` - If the vector file is too large, you can consider splitting it into multiple files. The naming format for the split files should be `train-[index]-of-[file_count].parquet`. For example, `train-01-of-10.parquet` represents the second file (0-indexed) among 10 split files.

diff --git a/vectordb_bench/frontend/components/custom/displaypPrams.py b/vectordb_bench/frontend/components/custom/displaypPrams.py
@@ -3,11 +3,22 @@ def displayParams(st):
         """
 - `Folder Path` - The path to the folder containing all the files. Please ensure that all files in the folder are in the `Parquet` format.
   - Vectors data files: The file must be named `train.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`.
-  - Query test vectors: The file must be named `test.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`.
+  - Query test vectors: The file must be named `test.parquet` and should have two columns: `id` as an incrementing `int` and `emb` as an array of `float32`. 
   - Ground truth file: The file must be named `neighbors.parquet` and should have two columns: `id` corresponding to query vectors and `neighbors_id` as an array of `int`.
 
 - `Train File Count` - If the vector file is too large, you can consider splitting it into multiple files. The naming format for the split files should be `train-[index]-of-[file_count].parquet`. For example, `train-01-of-10.parquet` represents the second file (0-indexed) among 10 split files.
 
 - `Use Shuffled Data` - If you check this option, the vector data files need to be modified. VectorDBBench will load the data labeled with `shuffle`. For example, use `shuffle_train.parquet` instead of `train.parquet` and `shuffle_train-04-of-10.parquet` instead of `train-04-of-10.parquet`. The `id` column in the shuffled data can be in any order.
 """
     )
+    st.caption(
+        """We recommend limiting the number of test query vectors, like 1,000.""",
+        help="""
+When conducting concurrent query tests, Vdbbench creates a large number of processes. 
+To minimize additional communication overhead during testing, 
+we prepare a complete set of test queries for each process, allowing them to run independently.\n
+However, this means that as the number of concurrent processes increases, 
+the number of copied query vectors also increases significantly, 
+which can place substantial pressure on memory resources.
+""",
+    )