Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transparency in Coverage JSON format presents significant challenges #773

Open
demit733 opened this issue Feb 4, 2025 · 0 comments
Open

Comments

@demit733
Copy link

demit733 commented Feb 4, 2025

The current Transparency in Coverage JSON format presents significant challenges when processing large, multi-gigabyte datasets. JSON, while flexible and human-readable, lacks inherent efficiencies for handling vast amounts of structured data at scale. Its nested and repetitive structure requires substantial computational resources, often demanding high memory usage that can quickly become cost-prohibitive. Processing such large datasets in their current format is not only resource-intensive but also time-consuming, as it necessitates extensive parsing, transformation, and optimization steps before meaningful analysis can be performed. These limitations make JSON an impractical choice for publishing and processing Transparency in Coverage data at scale.

An alternative would be to mandate NDJSON (Newline-Delimited JSON) – Streaming-Friendly JSON Alternative. Each line is a self-contained JSON object, making it easier to process line-by-line instead of loading the entire file. Provides better memory-efficient than regular JSON when handling massive datasets and can be easily processed using Unix tools (grep, awk), Python (pandas, jsonlines), or databases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant