-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEV: Mirror freely licensed arXiv documents locally #2904
Comments
Just did some quick verification:
For the arXiv-only files, we might need to have a look at their usages. Maybe it is possibly to replace them with some more liberal licensed ones without too much side effects. |
my opinion :
if we duplicate a copy of this files onto the github, isn't this considered as just making the documents available as another copy as web.archive.org would have done ? we are just using these document as support for test : they are not embedded within pypdf not infringing licenses |
The Wayback Machine tends to be victim of DDOS attacks regularly as well and their traffic is more important for other use cases in my opinion. This does not help with the current amount of regularly failing CI pipelines due to arXiv running into rate limits (even from my local device, without having done any downloads from them the hours before). My goal is to stabilize the tests again where freely licensed documents from arXiv seem like a good idea as they are mostly responsible for the failures at the moment anyway.
Nearly everything has a copyright, which has to be considered. As long as we are just downloading the data on the fly and run pypdf on it without persisting protected parts in a publicly accessible way, I do not see any issues (although IANAL). Storing our own public copies instead requires us to respect the original copyright and thus is more restrictive - we are not the Internet Archive.
This issue is specifically talking about creating our own hosted copies of them. In these cases I would like to avoid any licensing issues which could have negative impacts on the maintainers. |
We are currently experiencing regular issues with arXiv documents not being available for the Windows CI due to rate limit issues. At the same time, most of these documents are available under permissive licenses which would allow keeping an own repository of it which we could clone for CI while reducing the load for arXiv and GitHub downloads. I am open to generating this on my personal account for the time being.
List of licenses: https://info.arxiv.org/help/license/index.html For our repository, only https://arxiv.org/licenses/nonexclusive-distrib/1.0/license.html would be problematic due to not granting us any rights at all.
The text was updated successfully, but these errors were encountered: