Skip to content

Commit

Permalink
docs(update): Add documentation for new chunking strategy parameter (#…
Browse files Browse the repository at this point in the history
…361)

* Add chunking strategy option to topics

* Add more information about the chunking strategy parameter

* Updates

---------

Signed-off-by: Paul Wozniczka <25128922+pwoznic@users.noreply.github.com>
  • Loading branch information
pwoznic authored Dec 16, 2024
1 parent 90246bf commit 121b28b
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 19 deletions.
12 changes: 8 additions & 4 deletions www/docs/api-reference/indexing-apis/file-upload/file_upload.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,14 @@ following parts:

- `metadata` - (Optional) Specifies a JSON object representing any additional
metadata to be associated with the extracted document.
- `chunking_strategy` (Optional): A JSON object defining the chunking strategy
for breaking the document into parts. If unspecified, the default chunking
strategy creates one chunk per sentence.
Example: `'chunking_strategy={"type":"max_chars_chunking_strategy","max_chars_per_chunk":200};type=application/json'`
- `chunking_strategy` (Optional) Specifies whether to split the document into
chunks during ingestion. If not set, the platform defaults to sentence-based
chunking, where each chunk contains one full sentence. Set the `type` as
`max_chars_chunking_strategy` and then specify the `max_chars_per_chunk` to
the number of characters per chunk like `512` or `1024`. Smaller chunks may improve granularity
but can lead to excessive latency, especially in applications with high
document volumes or large corpora.
Example: `'chunking_strategy={"type":"max_chars_chunking_strategy","max_chars_per_chunk":1024};type=application/json'`
- `table_extraction_config` (Optional): A JSON object specifying whether to extract
tables from the PDF. By default, tables are not extracted.
Example: `'table_extraction_config={"extract_tables":true};type=application/json'`
Expand Down
21 changes: 18 additions & 3 deletions www/docs/api-reference/indexing-apis/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ stream the result or receive a complete response.

:::

### Index Document Request and Response
## Index Document Request and Response

To index a document, send a POST request to `/v2/corpora/:corpus_key/documents`,
where `corpus_key` is the unique identifier for the corpus where you want to
Expand All @@ -54,13 +54,28 @@ indicates how much quota would have been consumed.
`title`, `description`, `metadata`, `custom_dimensions`, and an array of
`sections`. These sections each have an `id`, `title`, `text`, `metadata`,
and nested `sections`.

:::note

The storage quota object returns the number of characters consumed and the
number of metadata characters consumed. The total quota consumed is simply the
sum of both values.
:::

### Structured document chunking

By default, Vectara uses sentence-based chunking, where each chunk consists of
one complete sentence. This strategy works well but can lead to higher
retrieval latency because of the increased number of chunks. Alternatively,
you can use character-based chunking to make the chunks larger.

Set the `type` to `max_chars_chunking_strategy` and define the `max_chars_per_chunk`
value to create larger chunks containing 3-7 sentences (`512` to `1024`). This
approach balances retrieval speed and contextual integrity.


:::tip
If not set, the platform defaults to sentence-based chunking, where each chunk
contains one full sentence. For more details, see [Document chunking](/docs/learn/select-ideal-indexing-api#document-chunking).
:::

## Core Document Object Definition
Expand Down
49 changes: 37 additions & 12 deletions www/docs/learn/select-ideal-indexing-api.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
id: select-ideal-indexing-api
title: Indexing
sidebar_label: Indexing
title: Data Ingestion
sidebar_label: Data Ingestion
---

import Tabs from '@theme/Tabs';
Expand All @@ -10,15 +10,15 @@ import CodeBlock from '@theme/CodeBlock';
import {vars} from '@site/static/variables.json';
import {Config} from '@site/docs/definitions.md';

Efficient data ingestion is critical for ensuring that your application
delivers fast, accurate, and relevant query results. Whether handling
structured, semi-structured, or unstructured data, selecting the
right indexing method can significantly impact the performance and usability
of your applications. Vectara offers multiple indexing methods to accommodate
different use cases that enable users to efficiently index their data and
leverage our advanced search capabilities. This flexible approach allows for
the precise integration of Vectara’s search functionalities into different
applications.
Efficient data ingestion, also known as indexing, is critical for ensuring
that your application delivers fast, accurate, and relevant query results.
Whether handling structured, semi-structured, or unstructured data, selecting
the right indexing method can significantly impact the performance and
usability of your applications. Vectara offers multiple indexing methods to
accommodate different use cases that enable users to efficiently index their
data and leverage our advanced search capabilities. This flexible approach
allows for the precise integration of Vectara’s search functionalities into
different applications.

## Vectara Ingest: sample data ingestion framework

Expand Down Expand Up @@ -74,7 +74,6 @@ We recommend this option for applications where documents already have a
clear and consistent structure like news articles, product descriptions,
rows in database tables or CSV files, or records from an ERP system.


### Core document definition

For the most advanced use cases, if you want full, granular control to chunk
Expand All @@ -96,3 +95,29 @@ organizational stakeholders.
By leveraging the appropriate data indexing method is based on the nature of
your documents, you can ingest and structure your data for optimal performance
with Vectara's Retrieval Augmented Generatation as-a-Service platform.

## Document chunking

Chunking refers to the process of breaking a document into smaller parts
(chunks) for efficient indexing and retrieval. Chunking is critical for
optimizing search performance, particularly for large documents and corpora.

Both the [File Upload API](/docs/api-reference/indexing-apis/file-upload/file-upload) and [Indexing API](/docs/api-reference/indexing-apis/indexing) provide an optional
`chunking_strategy` parameter that enables you to define how to chunk
documents during ingestion. When deciding on a chunking strategy, consider
the trade-offs between granularity and latency.

### Default chunking

By default, the platform uses sentence-based chunking, where each chunk
contains one complete sentence. This strategy can lead to higher retrieval
latency for large documents due to the increased number of chunks created.

### Fixed-size chunking

When you set the `type` to `max_chars_chunking_strategy`, you can then define
the maximum number of characters per chunk, which enables more granular control
over how the platform splits the document. We recommend trying 3–7 sentences
per chunk, which is about 512–1024 characters. This may be ideal for balancing
retrieval latency and context preservation

0 comments on commit 121b28b

Please sign in to comment.