Adding Feature to download also Zip-files with the script ? #3

Xx-Ylzakros-xX · 2022-08-05T13:02:37Z

Hi,

thanks for your great work to make it easy to download pdf files in the archive.

One issue remain in your scripts. If it is a zip-folder you can´t download them in the same automatic process.

Is it possible to change this by my own in an easy way or can you publish another script which do that great work ?

It would be great.

AlexanderMelde · 2022-08-05T14:29:03Z

Do you have an example of the magazine issues where this is happening, and can you tell me how exactly it fails please?
If they are available from the same website url, maybe you just need to rename .pdf to .zip after download.
If you want to change it in the code, this line is where you can start: https://github.com/AlexanderMelde/dl_for_heise/blob/master/download.sh#L107

Xx-Ylzakros-xX · 2022-08-05T19:25:40Z

Hi,

after i execute the download_articles.sh script as:

sudo ./download_articles.sh ct 1990

It gives the following output:

Logging in...

####################100,0%

[ct][1990/01][TRY 1/3] Server did not serve a valid pdf (instead application/zip; charset=ISO-8859-1).
Waiting for retry... 7/80s - 8.7% [=== ...

I know it find a Zip file and can´t proceed it, because there is not so further loop to proceed zip files to download and merge.

So he retry after 80 seconds and hang on the same issue.

I´m not a good programmer so i would ask you if you can add this to your script or as a new one e.g download_articles_ZIP.sh

I tried to change the .pdf things in the download_articles.sh script where it makes sense, but i think it´s not that easy to fix.

AlexanderMelde · 2022-08-05T20:17:32Z

I see, thank you for providing the error message.

Do you need to use download_articles.sh or will the download.sh script work as well for you?
The download_articles.sh script contains more PDF-specific code and can not be adapted to ZIP files as easily as the download.sh file.

If you will use download.sh: I think changing the following lines will work as a quick hack / workaround.
Can you try to edit the lines 104-107 of download.sh to the following?

                    elif [ "${content_type}" = 'application/zip; charset=ISO-8859-1' ]; then
                        # If the header states this is a pdf file, download it
                        echo "${logp} Downloading..."
                        actual_pdf_size=$(curl -# -b ${curl_session_file} -f -k -L --retry 99 "https://www.heise.de/select/${magazine}/archiv/${year}/${i}/download" -o "${file_base_path}.zip" --create-dirs -w '%{size_download}')

I only changed the ${content_type} value and the file ending in the above snippet. I think it should work then.

Xx-Ylzakros-xX · 2022-08-06T18:56:41Z

Hi,

thanks for your fast answer. I have tried to change the lines 104-107 as you suggested. I´m fine with any script which is working.

Well it also ends in an error code which i think is occured while the download.sh script is downloading complete files, but the articles from 1990 - 2009 are zipped html files, which are a number of 20 or more per month. As the single pdf files in the download_articles.sh script.

The error message:

[ct][1990/01][SKIP] Magazine issue does not exist on the server, skipping.
[ct][1990/02][SKIP] Magazine issue does not exist on the server, skipping.
....

Hopefully you can find a fix for that. Would be great to have the possibility to download the complete huge archive per magazine publisher.

gjaekel · 2022-11-25T15:00:17Z

@AlexanderMelde: The URL path was wrong, @Xx-Ylzakros-xX: please use

                        elif [ "${content_type}" = 'application/zip; charset=ISO-8859-1' ]; then 
                            # If the header states this is a pdf file, download it
                            echo "${logp} Downloading ZIP ..."
                            actual_pdf_size=$(curl -# -b ${curl_session_file} -f -k -L --retry 99 "https://www.heise.de/select/${magazine}/archiv/${year}/${i}/seite-${a}/pdf" -o "${file_base_path}.zip" --create-dirs -w '%{size_download}')

But be aware that the downloaded content may just be a "mockup": I also start to pull at year 1990 and for Issue 01 and 02, the HTML page just contain the title, the abstract and the statement

Dieser Beitrag unterliegt dem Copyright der BYTE, McGraw-Hill, Inc. und kann daher hier nicht veröffentlicht werden.

And of course, the code that will try to concat the downloaded PDF parts to a single volume will fail at the moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Feature to download also Zip-files with the script ? #3

Adding Feature to download also Zip-files with the script ? #3

Xx-Ylzakros-xX commented Aug 5, 2022

AlexanderMelde commented Aug 5, 2022

Xx-Ylzakros-xX commented Aug 5, 2022 •

edited

Loading

AlexanderMelde commented Aug 5, 2022

Xx-Ylzakros-xX commented Aug 6, 2022 •

edited

Loading

gjaekel commented Nov 25, 2022 •

edited

Loading

Adding Feature to download also Zip-files with the script ? #3

Adding Feature to download also Zip-files with the script ? #3

Comments

Xx-Ylzakros-xX commented Aug 5, 2022

AlexanderMelde commented Aug 5, 2022

Xx-Ylzakros-xX commented Aug 5, 2022 • edited Loading

AlexanderMelde commented Aug 5, 2022

Xx-Ylzakros-xX commented Aug 6, 2022 • edited Loading

gjaekel commented Nov 25, 2022 • edited Loading

Xx-Ylzakros-xX commented Aug 5, 2022 •

edited

Loading

Xx-Ylzakros-xX commented Aug 6, 2022 •

edited

Loading

gjaekel commented Nov 25, 2022 •

edited

Loading