Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New TOC CD-TEXT string decoding #633

Merged
merged 1 commit into from
Nov 25, 2024

Conversation

ntrrgc
Copy link
Contributor

@ntrrgc ntrrgc commented Aug 27, 2024

This patch replaces the previous broken approach to TOC string decoding that used .encode().decode('unicode_escape') with proper parsing of the escape sequences cdrdao is known to generate.

The new parser is also lenient with invalid escape sequences, that can occur due to improper escaping in cdrdao. See:
cdrdao/cdrdao#32

Latin-1:

This new parsing method should work for Latin-1 strings for both old and new versions of cdrdao, as long as those strings don't trigger the improper escaping issues in upstream cdrdao.

This has been verified with the album Diorama from the Danish black metal band MØL.

MS-JIS:

This new parsing method should also work for MS-JIS strings as long as the .toc file was generated by cdrdao 1.2.5+ and the strings don't trigger improper escaping issues in upstream cdrdao.

Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could not verify this.

cdrdao versions before 1.2.5 will still cause whipper to produce mojibake (garbled characters) when reading MS-JIS CD-Text, as those versions do not encode strings in UTF-8.

Other encodings:

As far as I know, CD-Text only supports officially ASCII, Latin-1 and MS-JIS, but I wouldn't be surprised if there are unofficial encodings out there, given the strange strings I've seen in some bug reports.

If you have a CD with garbled CD-Text, please submit a bug report indicating the performer, album name, language and attach the .toc file so that the produced strings can be compared to the expected text.

Fixes #169

Fixes #183

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💖 Thanks for opening your first pull request here! 💖

This patch replaces the previous broken approach to TOC string decoding
that used `.encode().decode('unicode_escape')` with proper parsing of
the escape sequences cdrdao is known to generate.

The new parser is also lenient with invalid escape sequences, that can
occur due to improper escaping in cdrdao. See:
cdrdao/cdrdao#32

Latin-1:

This new parsing method should work for Latin-1 strings for both old and
new versions of cdrdao, as long as those strings don't trigger the
improper escaping issues in upstream cdrdao.

This has been verified with the album Diorama from the Danish black
metal band MØL.

MS-JIS:

This new parsing method should also work for MS-JIS strings as long as
the .toc file was generated by cdrdao 1.2.5+ and the strings don't
trigger improper escaping issues in upstream cdrdao.

Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could
not verify this.

cdrdao versions before 1.2.5 will still cause whipper to produce
mojibake (garbled characters) when reading MS-JIS CD-Text, as those
versions do not encode strings in UTF-8.

Other encodings:

As far as I know, CD-Text only supports officially ASCII, Latin-1 and
MS-JIS, but I wouldn't be surprised if there are unofficial encodings
out there, given the strange strings I've seen in some bug reports.

If you have a CD with garbled CD-Text, please submit a bug report
indicating the performer, album name, language and attach the .toc file
so that the produced strings can be compared to the expected text.

Fixes whipper-team#169

Signed-off-by: Alicia Boya García <ntrrgc@gmail.com>
@ntrrgc ntrrgc force-pushed the toc-string-parsing branch from ac16fbc to 4719c74 Compare August 27, 2024 10:19
@ntrrgc
Copy link
Contributor Author

ntrrgc commented Aug 27, 2024

Previously this PR mentioned this fixed #183, but I now realized the original poster's issue was related to multi-session, not CD-Text. The second message in that issue is related to CD-Text, and this patch should still fix the crash in that particular case.

@ntrrgc
Copy link
Contributor Author

ntrrgc commented Nov 25, 2024

Can anyone review this?

@MerlijnWajer MerlijnWajer merged commit 799fd92 into whipper-team:develop Nov 25, 2024
3 checks passed
@MerlijnWajer
Copy link
Collaborator

Thank you for the clear description, additional tests and the code to resolve this problem. Apologies for the long wait and thank you for the ping on getting this reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TOCs of a couple of discs that cause whipper to fail CD-Text cannot be parsed, raising ValueErrors
2 participants