Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip already downloaded files #137

Open
jhkennedy opened this issue Aug 2, 2021 · 1 comment
Open

Skip already downloaded files #137

jhkennedy opened this issue Aug 2, 2021 · 1 comment
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@jhkennedy
Copy link
Contributor

When you've partially downloaded a batch/project (downloaded finished jobs before all finished, or in the subscription context especially), it'd nice to skip downloading already existing files.

Implementation options

  1. Simplest would be to add a skip_existing: bool = False option to the download_files() function signatures that skips the download if that files already exists
  2. We could get fancier and compare the sizes and download if they don't match since the size is reported in the API
  3. We could also check the file checksum, but since it's not calculated in the API we'd have to calculate both on the fly

Overall, I lean towards the simplest here.

@jhkennedy jhkennedy added enhancement New feature or request good first issue Good for newcomers labels Aug 2, 2021
@tshreve
Copy link

tshreve commented Sep 10, 2024

Hi, I ran into this same situation recently, and think it would be nice to have an option to skip existing files, as proposed above. My temporary solution was to add the following to util.py:

        if my_file.is_file():
            print(filepath, " already exists. Not downloading.")
            pass
        else:
            with session.get(url, stream=stream) as s:
                s.raise_for_status()
                tqdm = get_tqdm_progress_bar()
                with tqdm.wrapattr(open(filepath, "wb"), 'write', miniters=1, desc=filepath.name,
                                   total=int(s.headers.get('content-length', 0))) as f:
                    for chunk in s.iter_content(chunk_size=chunk_size):
                        if chunk:
                            f.write(chunk)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants