-
Notifications
You must be signed in to change notification settings - Fork 186
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into monorail-to-buganizer
- Loading branch information
Showing
1 changed file
with
88 additions
and
0 deletions.
There are no files selected for viewing
88 changes: 88 additions & 0 deletions
88
gcp/appengine/blog/content/posts/announcing-data-quality-initiatives/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
--- | ||
title: "OSV's approach to data quality" | ||
date: 2023-09-30T09:00:00Z | ||
draft: false | ||
author: Andrew Pollock and Charl de Nysschen | ||
--- | ||
OSV's mission is to enable developers to reduce security risk arising from known | ||
vulnerabilities in open source components they use. | ||
|
||
Part of the strategy to accomplish that mission is to provide a comprehensive, | ||
accurate and timely database of known vulnerabilities covering both language | ||
ecosystems and OS package distributions. | ||
|
||
Today, OSV.dev's coverage is fast approaching 30 ecosystems, while also | ||
importing records from almost as many disparate "[home databases](https://ossf.github.io/osv-schema/#id-modified-fields)". | ||
As this number of federated data sources continues to grow, so does the prospect | ||
of OSV records being expressed in ways that are detrimental to them being | ||
effectively utilized in aggregate. | ||
|
||
To ensure the accuracy and usability of OSV.dev's data at scale we have | ||
initiated a program of work to prevent future regression in data quality as the | ||
ecosystem of data contributions continues to grow. | ||
<!--more--> | ||
|
||
In our | ||
[experiences](https://www.first.org/conference/vulncon2024/program#pThe-Trials-and-Tribulations-of-Bulk-Converting-CVEs-to-OSV) | ||
from [interacting with the CVE Program and broader | ||
ecosystem](https://osv.dev/blog/posts/introducing-broad-c-c++-support/), we've | ||
found that the term "data quality" means different things to different people. | ||
|
||
For OSV.dev, the primary objective is to enable awareness and remediation of | ||
known vulnerabilities in open source components. To this end, "data quality" | ||
means being able to reason about and act upon vulnerability records at scale. | ||
This is why the OSV format was designed to enable machine-readability as its | ||
primary use case. In order to programmatically reason about OSV records at | ||
scale, a degree of consistent use of fields beyond what can be validated using | ||
JSON Schema validation alone is necessary. | ||
|
||
Problems that the OSV Data Quality Program seeks to address include: | ||
|
||
- No way for record providers to know there are problems with records they have already | ||
published | ||
- OSV.dev accepts non-schema-compliant records OSV.dev accepts records | ||
with other validity issues (such as invalid package names or non-existent | ||
package versions) | ||
- No turnkey way for an OSV record provider to bring the data | ||
quality problem forward, to earlier in the record publication lifecycle | ||
- No best practice tooling for OSV records to be created by a new OSV record provider | ||
- [Downstream data consumers often mistake OSV.dev as the originator for the data | ||
and provide feedback about it to us, rather than the record's originator](https://google.github.io/osv.dev/faq/#ive-found-something-wrong-with-the-data) | ||
- Git repository owners may not be following best-practice release processes (such as | ||
not using tags, or by using unusual tag naming conventions), confounding | ||
OSV.dev's ability to resolve fix commits for fix versions, which isn't known | ||
until the first time a vulnerability referencing the repository is published | ||
|
||
We have published our current opinion on the [Properties of a High Quality OSV | ||
Record](https://google.github.io/osv.dev/data_quality.html), which goes above | ||
and beyond JSON Schema compliance, and are working on an open source [OSV record | ||
linting tool](https://github.com/ossf/osv-schema/tree/main/tools/osv-linter) to | ||
programmatically validate records against these properties. | ||
|
||
Thereafter, we will begin gating record imports to records that meet the quality | ||
requirements. | ||
|
||
In order for the operators of home databases that OSV.dev imports from to be | ||
able to reason about the acceptability of records published, they will be able | ||
to: | ||
|
||
- run the OSV linter against their records as part of their publication | ||
workflow | ||
- review OSV.dev's import findings about their records | ||
|
||
You can follow our [progress on this journey on | ||
GitHub](https://github.com/orgs/google/projects/62). Input and contributions | ||
are, as always, appreciated. | ||
|
||
If you're responsible for an existing home database that OSV.dev imports records | ||
from, we will contact you directly before there are any changes to the record | ||
import process that may impact you. You can also consider proactively running | ||
our OSV record linter on your existing records to see how they rate. | ||
|
||
If you'd like to experiment with or help expand the capabilities of the OSV | ||
record linter, it's [currently residing in the OpenSSF OSV Schema GitHub | ||
repository](https://github.com/ossf/osv-schema/tree/main/tools/osv-linter). | ||
|
||
As an end-consumer of OSV.dev's data, we hope that this blog post encourages you | ||
to continue to have confidence in the capabilities enabled by that data into the | ||
future. |