From 7072af8eb4da792410702f63187505a5f00fe058 Mon Sep 17 00:00:00 2001 From: Paul Wozniczka <25128922+pwoznic@users.noreply.github.com> Date: Mon, 25 Nov 2024 14:25:19 -0600 Subject: [PATCH] Add more information about the chunking strategy parameter --- .../indexing-apis/file-upload/file_upload.md | 8 ++-- .../api-reference/indexing-apis/indexing.md | 17 ++------ www/docs/learn/select-ideal-indexing-api.md | 41 +++++++++---------- 3 files changed, 27 insertions(+), 39 deletions(-) diff --git a/www/docs/api-reference/indexing-apis/file-upload/file_upload.md b/www/docs/api-reference/indexing-apis/file-upload/file_upload.md index ec2b467d..bd655c64 100644 --- a/www/docs/api-reference/indexing-apis/file-upload/file_upload.md +++ b/www/docs/api-reference/indexing-apis/file-upload/file_upload.md @@ -54,11 +54,11 @@ following parts: - `metadata` - (Optional) Specifies a JSON object representing any additional metadata to be associated with the extracted document. -- `chunking_strategy` - (Optional) Specifies how the document should be split into +- `chunking_strategy` - (Optional) Specifies whether to split the document into chunks during ingestion. Set the `type` as `max_chars_chunking_strategy` and - then specify the `max_chars_per_chunk` to the number of characters per chunk. - If not set, the platform defaults to sentence-based chunking, where each chunk - contains one full sentence. + then specify the `max_chars_per_chunk` to the number of characters per chunk + like `200`. If not set, the platform defaults to sentence-based chunking, where + each chunk contains one full sentence. - `file` - Specifies the file that you want to upload. - `filename` - Specified as part of the `file` field with the file name that you want to associate with the uploaded file. diff --git a/www/docs/api-reference/indexing-apis/indexing.md b/www/docs/api-reference/indexing-apis/indexing.md index dada077c..342babbc 100644 --- a/www/docs/api-reference/indexing-apis/indexing.md +++ b/www/docs/api-reference/indexing-apis/indexing.md @@ -64,20 +64,11 @@ sum of both values. ### Structured document chunking Structured documents can also specify a `chunking_strategy` which indicates -how to split the document into chunks during ingestion. Set the `type` as +whether to split the document into chunks during ingestion. Set the `type` as `max_chars_chunking_strategy` and then specify the `max_chars_per_chunk` to -the number of characters per chunk. If not set, the platform defaults to -sentence-based chunking, where each chunk contains one full sentence. For more -details, see [Chunking strategy](/docs/learn/select-ideal-indexing-api#chunking-strategy). - -In this example, you apply a limit of 200 characters per chunk: - -```json -"chunking_strategy": { - "type": "max_chars_chunking_strategy", - "max_chars_per_chunk": 200 -} -``` +the number of characters per chunk such as `200`. If not set, the platform +defaults to sentence-based chunking, where each chunk contains one full +sentence. For more details, see [Document chunking](/docs/learn/select-ideal-indexing-api#document-chunking). ## Core Document Object Definition diff --git a/www/docs/learn/select-ideal-indexing-api.md b/www/docs/learn/select-ideal-indexing-api.md index 5afcab27..d0b9c12a 100644 --- a/www/docs/learn/select-ideal-indexing-api.md +++ b/www/docs/learn/select-ideal-indexing-api.md @@ -1,7 +1,7 @@ --- id: select-ideal-indexing-api -title: Indexing -sidebar_label: Indexing +title: Data Ingestion +sidebar_label: Data Ingestion --- import Tabs from '@theme/Tabs'; @@ -10,15 +10,15 @@ import CodeBlock from '@theme/CodeBlock'; import {vars} from '@site/static/variables.json'; import {Config} from '@site/docs/definitions.md'; -Efficient data ingestion is critical for ensuring that your application -delivers fast, accurate, and relevant query results. Whether handling -structured, semi-structured, or unstructured data, selecting the -right indexing method can significantly impact the performance and usability -of your applications. Vectara offers multiple indexing methods to accommodate -different use cases that enable users to efficiently index their data and -leverage our advanced search capabilities. This flexible approach allows for -the precise integration of Vectara’s search functionalities into different -applications. +Efficient data ingestion, also known as indexing, is critical for ensuring +that your application delivers fast, accurate, and relevant query results. +Whether handling structured, semi-structured, or unstructured data, selecting +the right indexing method can significantly impact the performance and +usability of your applications. Vectara offers multiple indexing methods to +accommodate different use cases that enable users to efficiently index their +data and leverage our advanced search capabilities. This flexible approach +allows for the precise integration of Vectara’s search functionalities into +different applications. ## Vectara Ingest: sample data ingestion framework @@ -74,7 +74,6 @@ We recommend this option for applications where documents already have a clear and consistent structure like news articles, product descriptions, rows in database tables or CSV files, or records from an ERP system. - ### Core document definition For the most advanced use cases, if you want full, granular control to chunk @@ -97,21 +96,19 @@ By leveraging the appropriate data indexing method is based on the nature of your documents, you can ingest and structure your data for optimal performance with Vectara's Retrieval Augmented Generatation as-a-Service platform. -## Chunking strategy +## Document chunking Chunking refers to the process of breaking a document into smaller parts (chunks) for efficient indexing and retrieval. Chunking is critical for optimizing search performance, particularly for large documents. -Both the [File Upload API](/docs/api-reference/indexing-apis/file-upload/file-upload) and [Indexing API](/docs/api-reference/indexing-apis/indexing) provide an optional `chunking_strategy` -parameter that enables you to define how documents should be chunked during -ingestion. When you set the `type` to `max_chars_chunking_strategy`, you can -then define the maximum number of characters per chunk which enables more -granular control over how the platform splits the document. +Both the [File Upload API](/docs/api-reference/indexing-apis/file-upload/file-upload) and [Indexing API](/docs/api-reference/indexing-apis/indexing) provide an optional +`chunking_strategy` parameter that enables you to define how to chunk +documents during ingestion. When you set the `type` to +`max_chars_chunking_strategy`, you can then define the maximum number of +characters per chunk, which enables more granular control over how the +platform splits the document. If you do not set this option, then the platform uses the default chunking -strategy that defaults to each chunk containing one full sentence. - - - +strategy that splits each chunk into one full sentence.