Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove audbackend.checksum() and use MD5 sum #255

Merged
merged 5 commits into from
Nov 26, 2024
Merged

Remove audbackend.checksum() and use MD5 sum #255

merged 5 commits into from
Nov 26, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Nov 22, 2024

Closes #254

This adds a test for #254 and fixes it by reversing the introduction of audbackend.checksum() from #245 as we shouldn't upload the checksum stored in the metadata of the parquet file, but as we did before always use the MD5 sum.

This will have no negative effect on version tracking with audb (i was wrong when stating this in audeering/audb#459). The only thing affected by uploading the MD5 sum instead of the parquet file hash is: if a user cancels an audb.publish() job, deletes the build folder, creates the table files again and starts the upload again, it will overwrite the parquet table files instead of skipping them, as their MD5 sum has changed locally. But this should be fine. If we stay with the current implementation instead, we have no way to verify if a file was uploaded/downloaded correctly as we don't store information on the actual MD5 sum.

Summary by Sourcery

Add a test for parquet file upload to Artifactory and fix the issue by reverting to using the MD5 checksum instead of the metadata checksum.

Bug Fixes:

  • Fix the issue with uploading parquet files by reverting to using the MD5 checksum instead of the checksum stored in the metadata.

Tests:

  • Add a test to ensure parquet files are uploaded with the correct MD5 checksum instead of the metadata checksum.

Summary by Sourcery

Revert the checksum calculation for parquet files to use the MD5 checksum instead of the metadata checksum to fix upload issues. Add a test to verify the correct checksum is used during the upload process.

Bug Fixes:

  • Fix the issue with uploading parquet files by reverting to using the MD5 checksum instead of the checksum stored in the metadata.

Tests:

  • Add a test to ensure parquet files are uploaded with the correct MD5 checksum instead of the metadata checksum.

Copy link
Contributor

sourcery-ai bot commented Nov 22, 2024

Reviewer's Guide by Sourcery

This PR fixes an issue with parquet file uploads by reverting back to using MD5 checksums instead of parquet metadata checksums. The implementation removes the checksum() function and replaces all its usages with direct calls to audeer.md5(). A new test has been added to verify the correct handling of parquet files with checksums in Artifactory.

Sequence diagram for parquet file upload with MD5 checksum

sequenceDiagram
    actor User
    participant Interface
    participant Artifactory
    participant Audeer

    User->>Interface: Upload parquet file
    Interface->>Audeer: Calculate MD5 checksum
    Audeer-->>Interface: Return MD5 checksum
    Interface->>Artifactory: Upload file with MD5 checksum
    Artifactory-->>Interface: Confirm upload
    Interface-->>User: Upload successful
Loading

File-Level Changes

Change Details Files
Remove custom checksum function and revert to using MD5 checksums
  • Remove the checksum() function that handled special cases for parquet files
  • Replace all utils.checksum() calls with audeer.md5()
audbackend/core/utils.py
audbackend/core/backend/base.py
audbackend/core/backend/filesystem.py
audbackend/core/interface/versioned.py
tests/bad_file_system.py
Add test infrastructure for parquet file handling
  • Create a new fixture that generates a test parquet file with metadata checksum
  • Add test case to verify correct handling of parquet files in Artifactory
tests/conftest.py
tests/test_backend_artifactory.py
Clean up documentation and configuration
  • Remove audformat from intersphinx mapping
  • Remove test_utils.py file
docs/conf.py
tests/test_utils.py

Assessment against linked issues

Issue Objective Addressed Explanation
#254 Fix Artifactory backend refusing to upload parquet files due to checksum mismatch

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hagenw hagenw marked this pull request as draft November 22, 2024 08:14
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hagenw - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link

codecov bot commented Nov 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (1b1177d) to head (798b82b).
Report is 1 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
audbackend/__init__.py 100.0% <ø> (ø)
audbackend/core/backend/base.py 100.0% <100.0%> (ø)
audbackend/core/backend/filesystem.py 100.0% <100.0%> (ø)
audbackend/core/interface/versioned.py 100.0% <ø> (ø)
audbackend/core/utils.py 100.0% <ø> (ø)

@hagenw hagenw marked this pull request as ready for review November 22, 2024 09:01
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hagenw - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@hagenw hagenw requested a review from ChristianGeng November 22, 2024 09:06
@hagenw hagenw changed the title TST: add failing test for parquet on Artifactory Remove audbackend.checksum() and use MD5 sum Nov 25, 2024
@ChristianGeng ChristianGeng self-requested a review November 26, 2024 15:00
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a minor issue concerning possible developer confusion
relating the terminology of md5sum and checksum.

This is resolved as good as it can get. So, approval is being given.

@hagenw hagenw merged commit 639d86c into main Nov 26, 2024
10 checks passed
@hagenw hagenw deleted the remove-checksum branch November 26, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Artifactory backend refuses to upload parquet files
2 participants