Skip to content

Commit

Permalink
Add more information about the chunking strategy parameter
Browse files Browse the repository at this point in the history
  • Loading branch information
pwoznic committed Nov 25, 2024
1 parent fe9bdcd commit 7072af8
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 39 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,11 @@ following parts:

- `metadata` - (Optional) Specifies a JSON object representing any additional
metadata to be associated with the extracted document.
- `chunking_strategy` - (Optional) Specifies how the document should be split into
- `chunking_strategy` - (Optional) Specifies whether to split the document into
chunks during ingestion. Set the `type` as `max_chars_chunking_strategy` and
then specify the `max_chars_per_chunk` to the number of characters per chunk.
If not set, the platform defaults to sentence-based chunking, where each chunk
contains one full sentence.
then specify the `max_chars_per_chunk` to the number of characters per chunk
like `200`. If not set, the platform defaults to sentence-based chunking, where
each chunk contains one full sentence.
- `file` - Specifies the file that you want to upload.
- `filename` - Specified as part of the `file` field with the file name that you
want to associate with the uploaded file.
Expand Down
17 changes: 4 additions & 13 deletions www/docs/api-reference/indexing-apis/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,20 +64,11 @@ sum of both values.
### Structured document chunking

Structured documents can also specify a `chunking_strategy` which indicates
how to split the document into chunks during ingestion. Set the `type` as
whether to split the document into chunks during ingestion. Set the `type` as
`max_chars_chunking_strategy` and then specify the `max_chars_per_chunk` to
the number of characters per chunk. If not set, the platform defaults to
sentence-based chunking, where each chunk contains one full sentence. For more
details, see [Chunking strategy](/docs/learn/select-ideal-indexing-api#chunking-strategy).

In this example, you apply a limit of 200 characters per chunk:

```json
"chunking_strategy": {
"type": "max_chars_chunking_strategy",
"max_chars_per_chunk": 200
}
```
the number of characters per chunk such as `200`. If not set, the platform
defaults to sentence-based chunking, where each chunk contains one full
sentence. For more details, see [Document chunking](/docs/learn/select-ideal-indexing-api#document-chunking).

## Core Document Object Definition

Expand Down
41 changes: 19 additions & 22 deletions www/docs/learn/select-ideal-indexing-api.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
id: select-ideal-indexing-api
title: Indexing
sidebar_label: Indexing
title: Data Ingestion
sidebar_label: Data Ingestion
---

import Tabs from '@theme/Tabs';
Expand All @@ -10,15 +10,15 @@ import CodeBlock from '@theme/CodeBlock';
import {vars} from '@site/static/variables.json';
import {Config} from '@site/docs/definitions.md';

Efficient data ingestion is critical for ensuring that your application
delivers fast, accurate, and relevant query results. Whether handling
structured, semi-structured, or unstructured data, selecting the
right indexing method can significantly impact the performance and usability
of your applications. Vectara offers multiple indexing methods to accommodate
different use cases that enable users to efficiently index their data and
leverage our advanced search capabilities. This flexible approach allows for
the precise integration of Vectara’s search functionalities into different
applications.
Efficient data ingestion, also known as indexing, is critical for ensuring
that your application delivers fast, accurate, and relevant query results.
Whether handling structured, semi-structured, or unstructured data, selecting
the right indexing method can significantly impact the performance and
usability of your applications. Vectara offers multiple indexing methods to
accommodate different use cases that enable users to efficiently index their
data and leverage our advanced search capabilities. This flexible approach
allows for the precise integration of Vectara’s search functionalities into
different applications.

## Vectara Ingest: sample data ingestion framework

Expand Down Expand Up @@ -74,7 +74,6 @@ We recommend this option for applications where documents already have a
clear and consistent structure like news articles, product descriptions,
rows in database tables or CSV files, or records from an ERP system.


### Core document definition

For the most advanced use cases, if you want full, granular control to chunk
Expand All @@ -97,21 +96,19 @@ By leveraging the appropriate data indexing method is based on the nature of
your documents, you can ingest and structure your data for optimal performance
with Vectara's Retrieval Augmented Generatation as-a-Service platform.

## Chunking strategy
## Document chunking

Chunking refers to the process of breaking a document into smaller parts
(chunks) for efficient indexing and retrieval. Chunking is critical for
optimizing search performance, particularly for large documents.

Both the [File Upload API](/docs/api-reference/indexing-apis/file-upload/file-upload) and [Indexing API](/docs/api-reference/indexing-apis/indexing) provide an optional `chunking_strategy`
parameter that enables you to define how documents should be chunked during
ingestion. When you set the `type` to `max_chars_chunking_strategy`, you can
then define the maximum number of characters per chunk which enables more
granular control over how the platform splits the document.
Both the [File Upload API](/docs/api-reference/indexing-apis/file-upload/file-upload) and [Indexing API](/docs/api-reference/indexing-apis/indexing) provide an optional
`chunking_strategy` parameter that enables you to define how to chunk
documents during ingestion. When you set the `type` to
`max_chars_chunking_strategy`, you can then define the maximum number of
characters per chunk, which enables more granular control over how the
platform splits the document.

If you do not set this option, then the platform uses the default chunking
strategy that defaults to each chunk containing one full sentence.



strategy that splits each chunk into one full sentence.

0 comments on commit 7072af8

Please sign in to comment.