Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bug): Update how metadata structuring is handled in API v2 #358

Merged
merged 2 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 107 additions & 23 deletions www/docs/api-reference/search-apis/interpreting-responses/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,12 @@ In <Config v="names.product"/>, when you [index a document](/docs/api-reference/
document has a `type` parameter that determines the format of the document
as `core` or `structured`. The `core` type has `document_parts` and the `structured`
type has `sections`. Both can be nested and both can contain separate `metadata`,
including some metadata that <Config v="names.product"/> will auto-generate.
A good example of this is that you could have a document which has some global
attributes like the `URL` or `owner` but individual sections will have a `section`
attribute and a `lang`.
including some metadata that <Config v="names.product"/> will auto-generate.

## Metadata structure

For example, a document might have global attributes such as the `URL` or `owner`
but individual sections have a `section` attribute and a `lang`.

Here's an example response with different metadata at these different levels:

Expand All @@ -29,12 +30,12 @@ Here's an example response with different metadata at these different levels:
"part_metadata": {
"speaker": "Deep Thought",
"lang": "eng",
"section": "2",
"offset": "316"
"section": 2,
"offset": 316
},
"document_metadata": {
"author": "Douglas Adams",
"publicationyear": "1979"
"publicationyear": 1979
},
"document_id": "hitchhikers-guide",
"request_corpora_index": 0
Expand All @@ -44,8 +45,8 @@ Here's an example response with different metadata at these different levels:
"score": 0.13511724770069122,
"part_metadata": {
"lang": "eng",
"section": "17",
"offset": "171"
"section": 17,
"offset": 171
},
"document_metadata": {
"author": "Dr. Seuss"
Expand All @@ -64,29 +65,112 @@ metadata. The reason for this split is that there may be multiple sections
from the same document in the response, and this allows for deduplication of
document-level metadata, which can reduce the total time for the response.

## Metadata type consistency

The metadata type conversion applies only to the `part_metadata` and
`document_metadata` fields in query responses. Metadata remains
unconverted during the document upload process, even when using API v2:

* **Numbers** are returned as numbers (for example, `section: 2`, `publicationyear: 1979`).
abhishekpradhan marked this conversation as resolved.
Show resolved Hide resolved
* **Booleans** are returned as `true` or `false` (case-sensitive).
* **JSON objects** maintain their native structure.
abhishekpradhan marked this conversation as resolved.
Show resolved Hide resolved

This behavior differs from API v1, where metadata such as `section` or
`publicationyear` might have been returned as strings (`"2"`, `"1979"`).
Ensure client applications handle these types correctly for smooth integration.

## Metadata type regex patterns

The following regex examples provide information about how each type is
identified and processed. By understanding these patterns, users can account
for type conversion in their client applications.

### Numbers regex

This pattern matches valid numeric formats, including integers, decimals, and
scientific notation, ensuring they are returned as numbers instead of strings.
Examples include `section: 2` or `offset: 316`.

**Pattern:** `-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?`


| Input | Matches | Explanation |
|------------|---------|--------------------------------------------|
| `123` | ✅ | Valid integer. |
| `0` | ✅ | Valid zero. |
| `-456` | ✅ | Valid negative integer. |
| `3.14` | ✅ | Valid decimal number. |
| `-0.001` | ✅ | Valid negative decimal. |
| `2e10` | ✅ | Valid scientific notation. |
| `-1.23E-4` | ✅ | Valid negative number in scientific notation. |
| `.5` | ❌ | Invalid (missing leading integer). |
| `1e` | ❌ | Invalid (missing exponent value). |
| `1.2.3` | ❌ | Invalid (multiple decimal points). |
| `-` | ❌ | Invalid (missing digits). |


### Boolean regex

This pattern matches exact boolean values (`true` or `false`), with exact case
sensitivity and no extra characters.

**Pattern:** `^(true|false)$`

| Input | Matches | Explanation |
|------------|---------|-------------------------------------------------|
| `true` | ✅ | Exact match for `true`. |
| `false` | ✅ | Exact match for `false`. |
| ` true` | ❌ | Invalid (leading space). |
| `false ` | ❌ | Invalid (trailing space). |
| `True` | ❌ | Invalid (case-sensitive; must be lowercase). |
| `TRUE` | ❌ | Invalid (case-sensitive; must be lowercase). |
| `falsey` | ❌ | Invalid (extra characters after `false`). |
| `truest` | ❌ | Invalid (extra characters after `true`). |
| `tru` | ❌ | Invalid (partial match; incomplete `true`). |

### JSON regex

This pattern identifies JSON-like structures, ensuring valid JSON objects so
that `{}` or arrays like `[]` are properly maintained.

**Pattern:** `^[{|\[].*$`

| Input | Matches | Explanation |
|---------------|---------|----------------------------------------------|
| `{example}` | ✅ | Starts with `{` and has additional content. |
| `[data]` | ✅ | Starts with `[` and has additional content. |
| `{` | ✅ | Matches a single `{` at the start. |
| `[` | ✅ | Matches a single `[` at the start. |
| `example` | ❌ | Does not start with `{` or `[`. |
| `something{` | ❌ | Starts with other characters, not `{`. |
| `(empty)` | ❌ | Empty string does not match. |

## Combining document and section metadata

In order to display metadata for a particular section, you may want to combine
it with the document-level metadata. To do so, look at the `document_id`
value. This tells you which document the metadata belongs to.
To display metadata for a particular section, you may want to combine it with
the document-level metadata.

In order to display metadata for a particular section, you may want to combine
it with the document-level metadata. Use the `document_id` value to determine
which document the metadata belongs to.

For example, the first result in the `search_results` array ("Answer to the Ultimate
Question of Life, the Universe, and Everything, is 42.") has a `document_id`
value of `hitchhikers-guide` and has a `part_metadata` of `speaker:Deep Thought`, `lang:eng`,
`section:2`, and `offset:316`. These are the section-level metadata for this
For example, the first result in the `search_results` array ("Answer to the Ultimate
Question of Life, the Universe, and Everything, is 42.") has a `document_id`
value of `hitchhikers-guide` and has a `part_metadata` of `speaker:Deep Thought`, `lang:eng`,
`section:2`, and `offset:316`. These are the section-level metadata for this
result.

Because the `document_id` is `hitchhikers-guide`, we look at the first result in the
`search_results` array to find the document-level metadata and document ID. In this
case, the `id` is `hitchhikers-guide` and the document-level metadata is
Because the `document_id` is `hitchhikers-guide`, we look at the first result in the
`search_results` array to find the document-level metadata and document ID. In this
case, the `id` is `hitchhikers-guide` and the document-level metadata is
`author:Douglas Adams` and `publicationyear:1979`.

Depending on your use case, you might want to combine these metadata elements
Depending on your use case, you might want to combine these metadata elements
together for display purposes.

## Filtering

You can also use the `document`- and `section`-level metadata to filter in a
search operation. For more information on how to apply filter expressions at
either the document or section/part level, please see the
You can also use the `document`- and `section`-level metadata to filter search
results. For more information on how to apply filter expressions at
either the document or section/part level, please see the
[filter expression](/docs/learn/metadata-search-filtering/filter-overview) documentation.
2 changes: 1 addition & 1 deletion www/docs/learn/recommendation-systems/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ are similar to the one they're looking at or a recently purchased product. These
use cases can be dealt with by using <Config v="names.product"/> in a
document-to-document search/recommendation platform. In order to do this, the
most important change is that you'll need to use `RESPONSE` similarity measure
(available to [our Scale plan users](https://vectara.com/pricing/)).
(available to [our Pro and Enterprise plan users](https://vectara.com/pricing/)).
It's easier to explain how this is different by first explaining how the `DEFAULT`
similarity works.

Expand Down
13 changes: 13 additions & 0 deletions www/docs/migration-guide-api-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,19 @@ In addition to the new Corpus Key:
requests.
* Remove the `textless` and `encrypted` fields from your requests.

## Metadata type conversions

Metadata remains unconverted during the document upload process, even when
using API v2. This means that numbers return as numbers, booleans return as
booleans, and JSON objects retain their native structure. This behavior
differs from API v1, where metadata such as `section` or `publicationyear` might
have been returned as strings. For more details, see [Reading Metadata](/docs/api-reference/search-apis/interpreting-responses/metadata).

**Action item:**

Ensure client applications handle these types correctly for smooth integration.


## Terminology, parameter, and property name changes

* API v1 uses `num_results` for specifying the maximum number of results
Expand Down
Loading