Idea: Compare destination folder filenames to remove "file exists" redundancy #22
Replies: 1 comment 3 replies
-
I think we'd either want to adjust the process_post definition to include this, or right before we begin to load futures with grouped_media_urls.items() here within the download_media function definition. I will say that this slightly goes against the new check on file size, or at least would warrant an adjustment to store file sizes with file names in a nested list or dictionary instead (ie: first check if pending filename exists and, if so, check size of pending file vs incoming file; if filename exists in target directory and file sizes are equal, skip file). I can take a stab at a branch to adjust this but not that acquainted with the source code yet and figure you may find a way that a bit simpler to implement. |
Beta Was this translation helpful? Give feedback.
-
As of 10/30 V0.7.1.1, the scraper only checks for if a file already exists at the point of download before continuing to the next file. For pages where you may be re-running the script, this may lead to a lot of extra runtime for maybe 3 or 4 extra posts that you're trying to catch up on.
IDEA
It might be worthwhile to
listdir
orwalk
the destination folder and compile the existing file names into a list and compare the entries in that list to the pending filenames for download that occurs in the initial step of scrape (ie:file for file in pending if file not in existing
). This would greatly increase performance and avoid redundant futures, scraping and downloading only net new filesBeta Was this translation helpful? Give feedback.
All reactions