Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VE-3102: DMH vector search: AION updates #590

Merged
merged 6 commits into from
Jul 9, 2024
Merged

VE-3102: DMH vector search: AION updates #590

merged 6 commits into from
Jul 9, 2024

Conversation

ndthang15
Copy link
Member

@ndthang15 ndthang15 commented Jun 24, 2024

Overview

AION needs to be updated with additional properties that could be used by GLC and DMH vector search.

Description

  1. Added referenceId property to objectResult and seriesItem definitions.
  2. Added embedding object (with vector, tags, and referenceId properties) to a new top-level in the AION schema.
  3. Added tags property to the top level of objectResult.
{
    "series": [{
        "start": 0,
        "stop": 1000,
        "referenceId": "ABC" // this is a new property added to the seriesItem definition.
    },
    {
        "start": 2000,
        "stop": 3000,
        "referenceId": "ABC"
    }],
    "embedding": [{  // "embedding" is on the same level as "object" and "series" in the AION schema.
        "vector": [0.0, 0.1]
        "tags": [],
         "referenceId": "ABC"
    },
    {
        "vector": [2.0, 2.1]
        "tags": [],
         "referenceId": "ABC"
    }],
}
  1. Updated AION schema with GLC prototype.
{
    "sourceEngineId": "AAAAAAAA-AAAA-AAAA-AAAA-AAAAAAAAAAAA", // existing property in the PREAMBLE definition.
    "internalTaskId": "eb20f1b3-20b4-471d-8dd9-76e077969af4", // this is a new property added to the PREAMBLE definition.
    "generatedDateUTC": "2024-05-15T15:27:33.848266649Z",
    "validationContracts": [
        "text"
    ],
	"object" : 
	[
		{
			"fingerprintVector" : [ // this is a new property added to the objectResult definition.
				// ...
			],
			"label" : "person",
			// this used to link timeSeries to a vector
			"referenceId" : "f46f029a-d42c-411c-ad01-8cffe3f87a63", // this is a new property added to the objectResult definition. 
			"tags" :       //  this is a new property added to the objectResult definition.
			[
				{
					"key" : "Accessory",
					"value" : "BagAny"
				}				
			],
			"type" : "fingerprint", //  existing property in the objectResult definition. 
			"vendor" : 
			{
				"label" : "fe_person_reid_512_v17_0_m34"
			}
		}
	],
	"series" : 
	[
		{
			"object" : 
			{
				"boundingPoly" : [
					// ...
				],
				"label" : "",
				"referenceId" : "f46f029a-d42c-411c-ad01-8cffe3f87a63", // this is a new property added to the objectResult definition.
				"type" : "object"
			},
			"startTimeMs" : 0,
			"stopTimeMs" : 0
		},
		{
			"object" : 
			{
				"boundingPoly" : [
					// ...
				],
				"label" : "",				
				"referenceId" : "f46f029a-d42c-411c-ad01-8cffe3f87a63", // this is a new property added to the objectResult definition.
				"type" : "object"
			},
			"startTimeMs" : 40,
			"stopTimeMs" : 40
		}				
	]
}
  1. Added valid and invalid tests to AION schema.

Related Issue

https://veritone.atlassian.net/browse/VE-3102

How Has This Been Tested

  1. Run npm i && npm run test in /packages/veritone-json-schemas folder.
  2. Ensure that all tests are passed.

@ndthang15 ndthang15 requested a review from alex-oleksiiuk June 24, 2024 10:00
Copy link

@orca-security-us orca-security-us bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca

@mgiasiVeri
Copy link

@ndthang15 we also need to support GLC's new AION format which is similar but introduces a few more fields like fingerprintVector. The ticket links to an example in the search draft doc, but the example is collapsed so you have to expand it.

Open up https://veritone.atlassian.net/wiki/spaces/ENG/pages/3330900143/Vector+Search+Draft#AION-updates and search for "GLC prototype" to expand and see it. Note that the object object does not need to have start and stop times. Here is another view of GLCs needed AION changes. https://veritone.atlassian.net/wiki/spaces/VP/pages/2859532299/Next+generation+tracker+AION+format#Accepted-Solution

@ndthang15
Copy link
Member Author

@ndthang15 we also need to support GLC's new AION format which is similar but introduces a few more fields like fingerprintVector. The ticket links to an example in the search draft doc, but the example is collapsed so you have to expand it.

Open up https://veritone.atlassian.net/wiki/spaces/ENG/pages/3330900143/Vector+Search+Draft#AION-updates and search for "GLC prototype" to expand and see it. Note that the object object does not need to have start and stop times. Here is another view of GLCs needed AION changes. https://veritone.atlassian.net/wiki/spaces/VP/pages/2859532299/Next+generation+tracker+AION+format#Accepted-Solution

@mgiasiVeri Today I have updated the schema to support GLC's new AION format base on your information. And this PR is ready for review again. Thank you so much for reviewing!

@mgiasiVeri
Copy link

mgiasiVeri commented Jun 25, 2024

@ndthang15 i think we need "tags" added at the objectResult level too. If you look at the GLC example, they leverage tags in the top level object, which is outside of the seriesItem. Let me know if you have any questions. And to be clear, we do not need the start and stop time values added at the objectResult level, that is a mistake in the examples provided.

@ndthang15
Copy link
Member Author

@ndthang15 i think we need "tags" added at the objectResult level too. If you look at the GLC example, they leverage tags in the top level object, which is outside of the seriesItem. Let me know if you have any questions. And to be clear, we do not need the start and stop time values added at the objectResult level, that is a mistake in the examples provided.

@mgiasiVeri You're right that I'm missing the tags in objectResult, so I've just added it in the latest commit.
About the start and stop time values at the objectResult level, we didn't add it to the schema. It is also a mistake when I copied the sample schema in document to PR's description.
Thank you so much for your notes!

@crondonveritone
Copy link

crondonveritone commented Jul 1, 2024

Based on the information of GLC prototype https://veritone.atlassian.net/wiki/spaces/ENG/pages/3330900143/Vector+Search+Draft#GLC-prototype%3A

I see that fingerPrintVector has the properties label, referenceId, and type but in the example, these properties are outer of fingerPrintVector (at #/definitions/objectResult level). If the second case is correct then that part LGTM:

@@ -767,6 +785,30 @@
"vendor": {
"description": "Custom data that doesn't conform to any other field. You can add any arbitrary data inside this object, but it will not be indexed, searchable, or have any impact on the system. However it will be returned when reading the data back out.",
"type": "object"
},
"fingerprintVector": {
Copy link

@crondonveritone crondonveritone Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we define the type and range of the items in the fingerprintVector array? like


    "items": {
      "type": "number",
      "minimum": 0,
      "maximum": 1,
    }

Copy link
Member Author

@ndthang15 ndthang15 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defined the type of items in the fingerprintVector array in e534527. I think we don't need to define range for it, like the KNN example in the ES document:

curl -X POST "localhost:9200/image-index/_bulk?refresh=true&pretty" -H 'Content-Type: application/json' -d'
{ "index": { "_id": "1" } }
{ "image-vector": [1, 5, -20], "title-vector": [12, 50, -10, 0, 1], "title": "moose family", "file-type": "jpg" }
{ "index": { "_id": "2" } }
{ "image-vector": [42, 8, -15], "title-vector": [25, 1, 4, -12, 2], "title": "alpine lake", "file-type": "png" }
{ "index": { "_id": "3" } }
{ "image-vector": [15, 11, 23], "title-vector": [1, 5, 25, 50, 20], "title": "full moon", "file-type": "jpg" }
...
'

Could you please review again? Thanks you!

@mgiasiVeri
Copy link

Based on the information of GLC prototype https://veritone.atlassian.net/wiki/spaces/ENG/pages/3330900143/Vector+Search+Draft#GLC-prototype%3A

I see that fingerPrintVector has the properties label, referenceId, and type but in the example, these properties are outer of fingerPrintVector (at #/definitions/objectResult level). If the second case is correct then that part LGTM:

@crondonveritone I think that was a typo. label, referenced, and type should all be outside of fingerprintVector. And fingerprintVector should only be an array of numbers. @frankayars @alex-oleksiiuk could you confirm.

Copy link

@crondonveritone crondonveritone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@frankayars frankayars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested two minor description changes.

@@ -767,6 +785,33 @@
"vendor": {
"description": "Custom data that doesn't conform to any other field. You can add any arbitrary data inside this object, but it will not be indexed, searchable, or have any impact on the system. However it will be returned when reading the data back out.",
"type": "object"
},
"fingerprintVector": {
"description": "An array of vectors related to the vectorized data",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:
An array of floats representing objects in vector space.

}
},
"embedding": {
"description": "The embedding engine result was generated",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:
An array of floats representing objects in vector space.

@ndthang15 ndthang15 requested a review from frankayars July 4, 2024 02:38
Copy link

sonarqubecloud bot commented Jul 4, 2024

@ndthang15
Copy link
Member Author

ndthang15 commented Jul 4, 2024

Suggested two minor description changes.

@frankayars Thanks for your suggestions. I changed the descriptions of these definitions. Also, the fingerprintVector definition was renamed 'vector'. Could you please review again?

Copy link

@frankayars frankayars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mgiasiVeri mgiasiVeri merged commit d44114b into master Jul 9, 2024
5 checks passed
@mgiasiVeri mgiasiVeri deleted the dev/VE-3102 branch July 9, 2024 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants