Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Feature to download also Zip-files with the script ? #3

Open
Xx-Ylzakros-xX opened this issue Aug 5, 2022 · 5 comments
Open

Comments

@Xx-Ylzakros-xX
Copy link

Hi,

thanks for your great work to make it easy to download pdf files in the archive.

One issue remain in your scripts. If it is a zip-folder you can´t download them in the same automatic process.

Is it possible to change this by my own in an easy way or can you publish another script which do that great work ?

It would be great.

@AlexanderMelde
Copy link
Owner

Do you have an example of the magazine issues where this is happening, and can you tell me how exactly it fails please?
If they are available from the same website url, maybe you just need to rename .pdf to .zip after download.
If you want to change it in the code, this line is where you can start: https://github.com/AlexanderMelde/dl_for_heise/blob/master/download.sh#L107

@Xx-Ylzakros-xX
Copy link
Author

Xx-Ylzakros-xX commented Aug 5, 2022

Hi,

after i execute the download_articles.sh script as:

sudo ./download_articles.sh ct 1990

It gives the following output:

Logging in...

####################100,0%

[ct][1990/01][TRY 1/3] Server did not serve a valid pdf (instead application/zip; charset=ISO-8859-1).
Waiting for retry... 7/80s - 8.7% [=== ...

I know it find a Zip file and can´t proceed it, because there is not so further loop to proceed zip files to download and merge.

So he retry after 80 seconds and hang on the same issue.

I´m not a good programmer so i would ask you if you can add this to your script or as a new one e.g download_articles_ZIP.sh

I tried to change the .pdf things in the download_articles.sh script where it makes sense, but i think it´s not that easy to fix.

@AlexanderMelde
Copy link
Owner

I see, thank you for providing the error message.

Do you need to use download_articles.sh or will the download.sh script work as well for you?
The download_articles.sh script contains more PDF-specific code and can not be adapted to ZIP files as easily as the download.sh file.

If you will use download.sh: I think changing the following lines will work as a quick hack / workaround.
Can you try to edit the lines 104-107 of download.sh to the following?

                    elif [ "${content_type}" = 'application/zip; charset=ISO-8859-1' ]; then
                        # If the header states this is a pdf file, download it
                        echo "${logp} Downloading..."
                        actual_pdf_size=$(curl -# -b ${curl_session_file} -f -k -L --retry 99 "https://www.heise.de/select/${magazine}/archiv/${year}/${i}/download" -o "${file_base_path}.zip" --create-dirs -w '%{size_download}')

I only changed the ${content_type} value and the file ending in the above snippet. I think it should work then.

@Xx-Ylzakros-xX
Copy link
Author

Xx-Ylzakros-xX commented Aug 6, 2022

Hi,

thanks for your fast answer. I have tried to change the lines 104-107 as you suggested. I´m fine with any script which is working.

Well it also ends in an error code which i think is occured while the download.sh script is downloading complete files, but the articles from 1990 - 2009 are zipped html files, which are a number of 20 or more per month. As the single pdf files in the download_articles.sh script.

The error message:

[ct][1990/01][SKIP] Magazine issue does not exist on the server, skipping.
[ct][1990/02][SKIP] Magazine issue does not exist on the server, skipping.
....

Hopefully you can find a fix for that. Would be great to have the possibility to download the complete huge archive per magazine publisher.

@gjaekel
Copy link

gjaekel commented Nov 25, 2022

@AlexanderMelde: The URL path was wrong, @Xx-Ylzakros-xX: please use

                        elif [ "${content_type}" = 'application/zip; charset=ISO-8859-1' ]; then 
                            # If the header states this is a pdf file, download it
                            echo "${logp} Downloading ZIP ..."
                            actual_pdf_size=$(curl -# -b ${curl_session_file} -f -k -L --retry 99 "https://www.heise.de/select/${magazine}/archiv/${year}/${i}/seite-${a}/pdf" -o "${file_base_path}.zip" --create-dirs -w '%{size_download}')

But be aware that the downloaded content may just be a "mockup": I also start to pull at year 1990 and for Issue 01 and 02, the HTML page just contain the title, the abstract and the statement

Dieser Beitrag unterliegt dem Copyright der BYTE, McGraw-Hill, Inc. und kann daher hier nicht veröffentlicht werden.

And of course, the code that will try to concat the downloaded PDF parts to a single volume will fail at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants