Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github LFS use and removing large, intermediate TSVs #45

Open
tlongers opened this issue Dec 4, 2024 · 0 comments
Open

Github LFS use and removing large, intermediate TSVs #45

tlongers opened this issue Dec 4, 2024 · 0 comments

Comments

@tlongers
Copy link
Member

tlongers commented Dec 4, 2024

This project generates a lot of large TSV files:

  • during the data extraction process, full copies of the data as it passes thorugh different states are generated (e.g. see any of the output folders).
  • the final dataset is large.

These files trigger the use of Github's LFS system, which may at some point become a financial cost for the project. Whilst it's useful as an assurance and debugging measure to generate lots of tsvs at different points in processing, they're unncessary to keep long term. In all cases, these files can be re-generated from their inputs anyhow.

Some work on the sub-projects to prune these intemediate outputs from the historical repo would be a useful investment in the event of a change in LFS storage costs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant