Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: simplify base64 decoding in the lenient case #56905

Closed
wants to merge 1 commit into from

Conversation

lemire
Copy link
Member

@lemire lemire commented Feb 3, 2025

The simdutf library can be instructed to ignore invalid characters, which suggests that we may be able to simplify some of our C++ code.

@lemire lemire requested review from targos and anonrig February 3, 2025 16:47
@nodejs-github-bot nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Feb 3, 2025
@lemire lemire added request-ci Add this label to start a Jenkins CI on a PR. and removed buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Feb 3, 2025
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Feb 3, 2025
@nodejs-github-bot
Copy link
Collaborator

@anonrig
Copy link
Member

anonrig commented Feb 3, 2025

@lemire it seems the tests are failing

Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 89.07%. Comparing base (6857dbc) to head (516aaee).
Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
src/string_bytes.cc 25.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #56905      +/-   ##
==========================================
+ Coverage   88.70%   89.07%   +0.37%     
==========================================
  Files         663      665       +2     
  Lines      192012   192586     +574     
  Branches    36613    37014     +401     
==========================================
+ Hits       170317   171543    +1226     
+ Misses      14520    13838     -682     
- Partials     7175     7205      +30     
Files with missing lines Coverage Δ
src/string_bytes.cc 65.96% <25.00%> (-2.48%) ⬇️

... and 116 files with indirect coverage changes

@lemire
Copy link
Member Author

lemire commented Feb 3, 2025

I am going to drop this PR because, as I feared, the expectation is that Buffer.from('put anything here', 'base64')==Buffer.from('put anything here', 'base64url'). That is, even though we have two distinct encoding superficially, there are treated the same.

So it is not simply that Node.js does not validate, but it is also that it does not distinguish (at all) between 'base64url' and 'base64'. This seems to contradict the documentation:

'base64': Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC 4648, Section 5. Whitespace characters such as spaces, tabs, and new lines contained within the base64-encoded string are ignored.
'base64url': base64url encoding as specified in RFC 4648, Section 5. When creating a Buffer from a string, this encoding will also correctly accept regular base64-encoded strings. When encoding a Buffer to a string, this encoding will omit padding.

Somewhat ironically, here is what RFC 4648 section 5 says...

This encoding may be referred to as "base64url". This encoding should not be regarded as the same as the "base64" encoding and should not be referred to as only "base64". Unless clarified otherwise, "base64" refers to the base 64 in the previous section.

What Node.js does is to confuse the two encodings. Currently, the simdutf does not do that: it treats base64url and base64 as two different encodings (as per RFC 4648).

So we need the current fallback code and my PR is not going to work.

Note that Node.js can decode invalid sequences in interesting ways...

Buffer.from('__-+_--_--/_', 'base64url') == Buffer.from('__--_--_--__', 'base64url') == Buffer.from('__--_--_--__', 'base64') == Buffer.from('__-+_--_--/_', 'base64')

That's definitively not in RFC 4648.

@lemire lemire closed this Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants