docs(update): Add documentation for new chunking strategy parameter (#…

…361) * Add chunking strategy option to topics * Add more information about the chunking strategy parameter * Updates --------- Signed-off-by: Paul Wozniczka <25128922+pwoznic@users.noreply.github.com>
vectara · Dec 16, 2024 · 121b28b · 121b28b
1 parent 90246bf
commit 121b28b
Show file tree

Hide file tree

Showing 3 changed files with 63 additions and 19 deletions.
diff --git a/www/docs/api-reference/indexing-apis/file-upload/file_upload.md b/www/docs/api-reference/indexing-apis/file-upload/file_upload.md
@@ -55,10 +55,14 @@ following parts:
 
 - `metadata` - (Optional) Specifies a JSON object representing any additional
   metadata to be associated with the extracted document.
-- `chunking_strategy` (Optional): A JSON object defining the chunking strategy 
-  for breaking the document into parts. If unspecified, the default chunking 
-  strategy creates one chunk per sentence.
-  Example: `'chunking_strategy={"type":"max_chars_chunking_strategy","max_chars_per_chunk":200};type=application/json'`
+- `chunking_strategy` (Optional) Specifies whether to split the document into 
+  chunks during ingestion. If not set, the platform defaults to sentence-based 
+  chunking, where each chunk contains one full sentence. Set the `type` as 
+  `max_chars_chunking_strategy` and then specify the `max_chars_per_chunk` to 
+  the number of characters per chunk like `512` or `1024`. Smaller chunks may improve granularity 
+  but can lead to excessive latency, especially in applications with high 
+  document volumes or large corpora.
+  Example: `'chunking_strategy={"type":"max_chars_chunking_strategy","max_chars_per_chunk":1024};type=application/json'`
 - `table_extraction_config` (Optional): A JSON object specifying whether to extract 
   tables from the PDF. By default, tables are not extracted.
   Example: `'table_extraction_config={"extract_tables":true};type=application/json'`

diff --git a/www/docs/api-reference/indexing-apis/indexing.md b/www/docs/api-reference/indexing-apis/indexing.md
@@ -31,7 +31,7 @@ stream the result or receive a complete response.
 
 :::
 
-### Index Document Request and Response
+## Index Document Request and Response
 
 To index a document, send a POST request to `/v2/corpora/:corpus_key/documents`,
 where `corpus_key` is the unique identifier for the corpus where you want to
@@ -54,13 +54,28 @@ indicates how much quota would have been consumed.
   `title`, `description`, `metadata`, `custom_dimensions`, and an array of
   `sections`. These sections each have an `id`, `title`, `text`, `metadata`,
   and nested `sections`.
-
+  
 :::note
-
 The storage quota object returns the number of characters consumed and the
 number of metadata characters consumed. The total quota consumed is simply the
 sum of both values.
+:::
+
+### Structured document chunking
 
+By default, Vectara uses sentence-based chunking, where each chunk consists of 
+one complete sentence. This strategy works well but can lead to higher 
+retrieval latency because of the increased number of chunks. Alternatively, 
+you can use character-based chunking to make the chunks larger.
+
+Set the `type` to `max_chars_chunking_strategy` and define the `max_chars_per_chunk` 
+value to create larger chunks containing 3-7 sentences (`512` to `1024`). This 
+approach balances retrieval speed and contextual integrity.
+
+
+:::tip
+If not set, the platform defaults to sentence-based chunking, where each chunk 
+contains one full sentence. For more details, see [Document chunking](/docs/learn/select-ideal-indexing-api#document-chunking).
 :::
 
 ## Core Document Object Definition

diff --git a/www/docs/learn/select-ideal-indexing-api.md b/www/docs/learn/select-ideal-indexing-api.md
@@ -1,7 +1,7 @@
 ---
 id: select-ideal-indexing-api
-title: Indexing
-sidebar_label: Indexing
+title: Data Ingestion
+sidebar_label: Data Ingestion
 ---
 
 import Tabs from '@theme/Tabs';
@@ -10,15 +10,15 @@ import CodeBlock from '@theme/CodeBlock';
 import {vars} from '@site/static/variables.json';
 import {Config} from '@site/docs/definitions.md';
 
-Efficient data ingestion is critical for ensuring that your application 
-delivers fast, accurate, and relevant query results. Whether handling 
-structured, semi-structured, or unstructured data, selecting the 
-right indexing method can significantly impact the performance and usability 
-of your applications. Vectara offers multiple indexing methods to accommodate 
-different use cases that enable users to efficiently index their data and 
-leverage our advanced search capabilities. This flexible approach allows for 
-the precise integration of Vectara’s search functionalities into different 
-applications.
+Efficient data ingestion, also known as indexing, is critical for ensuring 
+that your application delivers fast, accurate, and relevant query results. 
+Whether handling structured, semi-structured, or unstructured data, selecting 
+the right indexing method can significantly impact the performance and 
+usability of your applications. Vectara offers multiple indexing methods to 
+accommodate different use cases that enable users to efficiently index their 
+data and leverage our advanced search capabilities. This flexible approach 
+allows for the precise integration of Vectara’s search functionalities into 
+different applications.
 
 ## Vectara Ingest: sample data ingestion framework
 
@@ -74,7 +74,6 @@ We recommend this option for applications where documents already have a
 clear and consistent structure like news articles, product descriptions, 
 rows in database tables or CSV files, or records from an ERP system.
 
-
 ### Core document definition
 
 For the most advanced use cases, if you want full, granular control to chunk 
@@ -96,3 +95,29 @@ organizational stakeholders.
 By leveraging the appropriate data indexing method is based on the nature of 
 your documents, you can ingest and structure your data for optimal performance 
 with Vectara's Retrieval Augmented Generatation as-a-Service platform.
+
+## Document chunking
+
+Chunking refers to the process of breaking a document into smaller parts 
+(chunks) for efficient indexing and retrieval. Chunking is critical for 
+optimizing search performance, particularly for large documents and corpora.
+
+Both the [File Upload API](/docs/api-reference/indexing-apis/file-upload/file-upload) and [Indexing API](/docs/api-reference/indexing-apis/indexing) provide an optional 
+`chunking_strategy` parameter that enables you to define how to chunk 
+documents during ingestion. When deciding on a chunking strategy, consider 
+the trade-offs between granularity and latency.
+
+### Default chunking
+
+By default, the platform uses sentence-based chunking, where each chunk 
+contains one complete sentence. This strategy can lead to higher retrieval 
+latency for large documents due to the increased number of chunks created.
+
+### Fixed-size chunking
+
+When you set the `type` to `max_chars_chunking_strategy`, you can then define 
+the maximum number of characters per chunk, which enables more granular control 
+over how the platform splits the document. We recommend trying 3–7 sentences 
+per chunk, which is about 512–1024 characters. This may be ideal for balancing 
+retrieval latency and context preservation
+