Idea: Compare destination folder filenames to remove "file exists" redundancy #22

01000011-shade · 2024-10-30T04:09:17Z

01000011-shade
Oct 30, 2024

As of 10/30 V0.7.1.1, the scraper only checks for if a file already exists at the point of download before continuing to the next file. For pages where you may be re-running the script, this may lead to a lot of extra runtime for maybe 3 or 4 extra posts that you're trying to catch up on.

IDEA
It might be worthwhile to listdir or walk the destination folder and compile the existing file names into a list and compare the entries in that list to the pending filenames for download that occurs in the initial step of scrape (ie: file for file in pending if file not in existing). This would greatly increase performance and avoid redundant futures, scraping and downloading only net new files

01000011-shade · 2024-10-30T04:32:40Z

01000011-shade
Oct 30, 2024
Author

I think we'd either want to adjust the process_post definition to include this, or right before we begin to load futures with grouped_media_urls.items() here within the download_media function definition.

I will say that this slightly goes against the new check on file size, or at least would warrant an adjustment to store file sizes with file names in a nested list or dictionary instead (ie: first check if pending filename exists and, if so, check size of pending file vs incoming file; if filename exists in target directory and file sizes are equal, skip file).

I can take a stab at a branch to adjust this but not that acquainted with the source code yet and figure you may find a way that a bit simpler to implement.

3 replies

Emy69 Oct 30, 2024
Maintainer

I made changes according to what you wrote me, the truth is that I don't have much time so I haven't applied them yet, I just show you an example, thanks in advance for telling me about this redundancy <3 .

Listing Existing Files: Before starting the download, the function now lists all existing files in the target directory. This is achieved by using os.walk() to create a dictionary of existing filenames along with their sizes. This will help us check against pending filenames right from the start.
Comparison with Pending Filenames: During the download process, we now compare the filenames of media URLs to the existing files. If a file already exists, we then check its size to determine if we need to download it again.
Size Check Implementation: The code now includes logic to check the size of the remote file against the size of the existing file. If the sizes match, we skip the download, logging that the file already exists and is complete.

Here’s a snippet of the modified code:

def download_media(self, site, user_id, service, query=None, download_all=False, initial_offset=0):
        try:
            posts = self.fetch_user_posts(site, user_id, service, query=query, initial_offset=initial_offset, log_fetching=download_all)
            if not posts:
                self.log(self.tr("No posts found for this user."))
                return

            if not download_all:
                posts = posts[:50]  # Limit to the first 50 posts

            futures = []
            grouped_media_urls = defaultdict(list)

            # Listing existing files
            existing_files = {}
            for root, _, files in os.walk(self.download_folder):
                for file in files:
                    filepath = os.path.join(root, file)
                    existing_files[file] = os.path.getsize(filepath)

            for post in posts:
                media_urls = self.process_post(post)
                for media_url in media_urls:
                    grouped_media_urls[post['id']].append(media_url)

            self.total_files = sum(len(urls) for urls in grouped_media_urls.values())
            self.completed_files = 0

            for post_id, media_urls in grouped_media_urls.items():
                for media_url in media_urls:
                    filename = os.path.basename(media_url).split('?')[0]
                    # Inside the download loop
                    if filename in existing_files:
                       # Obtain remote file size
                        response = self.safe_request(media_url)
                        if response is None:
                            self.log(self.tr("Failed to download after multiple retries: {media_url}", media_url=media_url))
                            self.failed_files.append(media_url)
                            continue

                        remote_file_size = int(response.headers.get('content-length', 0))
                        if existing_files[filename] == remote_file_size:
                            self.log(self.tr("File already exists and is complete, skipping: {filename}", filename=filename))
                            self.skipped_files.append(filename)
                            continue

                    if self.download_mode == 'queue':
                        self.process_media_element(media_url, user_id, post_id)
                    else:
                        future = self.executor.submit(self.process_media_element, media_url, user_id, post_id)
                        futures.append(future)

            if self.download_mode == 'multi':
                for future in as_completed(futures):
                    if self.cancel_requested.is_set():
                        break

            if self.failed_files:
                self.log(self.tr("Retrying failed downloads..."))
                for media_url in self.failed_files:
                    if self.cancel_requested.is_set():
                        break
                    self.process_media_element(media_url, user_id)
                self.failed_files.clear()

        except Exception as e:
            self.log(self.tr(f"Error during download: {e}"))
        finally:
            self.shutdown_executor()

01000011-shade Oct 30, 2024
Author

Awesome! Great work on the tool mate, this will definitely help to make it more productive.

Emy69 Oct 31, 2024
Maintainer

If you would like to submit a pull request with any changes, you are completely welcome to do so. Commit 31433d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Compare destination folder filenames to remove "file exists" redundancy #22

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Idea: Compare destination folder filenames to remove "file exists" redundancy #22

01000011-shade Oct 30, 2024

Replies: 1 comment · 3 replies

01000011-shade Oct 30, 2024 Author

Emy69 Oct 30, 2024 Maintainer

01000011-shade Oct 30, 2024 Author

Emy69 Oct 31, 2024 Maintainer

01000011-shade
Oct 30, 2024

Replies: 1 comment 3 replies

01000011-shade
Oct 30, 2024
Author

Emy69 Oct 30, 2024
Maintainer

01000011-shade Oct 30, 2024
Author

Emy69 Oct 31, 2024
Maintainer