Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEV: Mirror freely licensed arXiv documents locally #2904

Open
stefan6419846 opened this issue Oct 15, 2024 · 3 comments
Open

DEV: Mirror freely licensed arXiv documents locally #2904

stefan6419846 opened this issue Oct 15, 2024 · 3 comments
Labels
nf-ci Non-functional change: Continuous Integration nf-testing Non-functional change: Testing

Comments

@stefan6419846
Copy link
Collaborator

We are currently experiencing regular issues with arXiv documents not being available for the Windows CI due to rate limit issues. At the same time, most of these documents are available under permissive licenses which would allow keeping an own repository of it which we could clone for CI while reducing the load for arXiv and GitHub downloads. I am open to generating this on my personal account for the time being.

List of licenses: https://info.arxiv.org/help/license/index.html For our repository, only https://arxiv.org/licenses/nonexclusive-distrib/1.0/license.html would be problematic due to not granting us any rights at all.

@stefan6419846 stefan6419846 added nf-testing Non-functional change: Testing nf-ci Non-functional change: Continuous Integration labels Oct 15, 2024
@stefan6419846
Copy link
Collaborator Author

Just did some quick verification:

  • 6x CC-BY-4.0
  • 1x CC-BY-SA-4.0
  • 1x CC-BY-NC-ND-4.0
  • 6x arXiv-only license (1707.09725, 2005.05909, 2201.00151, 2201.00178, 2201.00200, 2201.00201)

For the arXiv-only files, we might need to have a look at their usages. Maybe it is possibly to replace them with some more liberal licensed ones without too much side effects.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Oct 15, 2024

my opinion :
arxiv.org is available on web:archive.org:
https://web.archive.org/web/20241009013003/https://arxiv.org/

arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Materials on this site are not peer-reviewed by arXiv.

if we duplicate a copy of this files onto the github, isn't this considered as just making the documents available as another copy as web.archive.org would have done ?

we are just using these document as support for test : they are not embedded within pypdf not infringing licenses

@stefan6419846
Copy link
Collaborator Author

The Wayback Machine tends to be victim of DDOS attacks regularly as well and their traffic is more important for other use cases in my opinion. This does not help with the current amount of regularly failing CI pipelines due to arXiv running into rate limits (even from my local device, without having done any downloads from them the hours before). My goal is to stabilize the tests again where freely licensed documents from arXiv seem like a good idea as they are mostly responsible for the failures at the moment anyway.

if we duplicate a copy of this files onto the github, isn't this considered as just making the documents available as another copy as web.archive.org would have done ?

Nearly everything has a copyright, which has to be considered. As long as we are just downloading the data on the fly and run pypdf on it without persisting protected parts in a publicly accessible way, I do not see any issues (although IANAL). Storing our own public copies instead requires us to respect the original copyright and thus is more restrictive - we are not the Internet Archive.

we are just using these document as support for test : they are not embedded within pypdf not infringing licenses

This issue is specifically talking about creating our own hosted copies of them. In these cases I would like to avoid any licensing issues which could have negative impacts on the maintainers.

@py-pdf py-pdf deleted a comment Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-ci Non-functional change: Continuous Integration nf-testing Non-functional change: Testing
Projects
None yet
Development

No branches or pull requests

2 participants