Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement character table control and codepage 437 option #195

Merged
merged 18 commits into from
Dec 11, 2023

Conversation

sharifhsn
Copy link
Contributor

Implements #194

Additional comments:

  • I've directly copy-pasted the codepage table from the repository in the blog post you linked. I'm not sure how to properly give credit for it.
  • I replaced the null character with the ⋄ character since null doesn't appear in printed output, which is pretty common.

@sharifhsn
Copy link
Contributor Author

Here's a visual image of the difference.
image

@sharkdp
Copy link
Owner

sharkdp commented Jun 13, 2023

Wow! That was fast 😄

Thank you so much.

I've directly copy-pasted the codepage table from the repository in the blog post you linked. I'm not sure how to properly give credit for it.

I think I saw a MIT license in the repository earlier today? If so, I would just copy the license text verbatim and paste it into a comment preceding the table. And add a reference to the original source file + author, if not present in the license text. Maybe let's also describe the manual modifications we did in that comment.

@sharkdp
Copy link
Owner

sharkdp commented Jun 13, 2023

Look how cool a .gif file looks with this feature enabled 😄

image

src/bin/hexyl.rs Outdated Show resolved Hide resolved
src/bin/hexyl.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@delan
Copy link

delan commented Jun 14, 2023

Nice! I’m glad this idea is catching on.

Note that ␀ is already a printable character, since it’s U+2400 SYMBOL FOR NULL rather than U+0000 NULL. In fact, the whole range U+2400 through U+241F are printable representations of the C0 controls U+0000 through U+001F, though I chose to use the cp437 dingbats for the others. ⋄ probably makes more sense for hexyl though :)

FFh is also visually indistinguishable from 20h in this table, so for the Rewind path interpreter, we used U+FB00 LATIN SMALL LIGATURE FF (ff) for FFh.

@sharkdp
Copy link
Owner

sharkdp commented Jun 14, 2023

Nice! I’m glad this idea is catching on.

Thank you very much for commenting here and for the original idea! A while ago, I built binocle, which uses a similar idea in a graphical application (distinctly colored pixels for each byte), but I never thought about translating this idea to the character panel of hexyl.

I'm now thinking that it might also be cool to have a mode in hexyl where we have distinct colors for every byte (not just for different categories). We could then print █ (U+2588) for every byte in the character table, but colorize it differently using ANSI escape sequences. As for this PR, we could maybe already add a --character-table block (or similar) variant where we simply print █ for each byte. With the category-colorization, it would look like this for the GIF example above:

image

Note that ␀ is already a printable character, since it’s U+2400 SYMBOL FOR NULL rather than U+0000 NULL. In fact, the whole range U+2400 through U+241F are printable representations of the C0 controls U+0000 through U+001F, though I chose to use the cp437 dingbats for the others.

We use (some of) those U+2400 … U+241F symbols in bat --show-all. I like the idea. But at least with my terminal emulator / font, they are usually really hard to read. Even at large font sizes, I can barely make out the "NUL" characters. This is why we chose for hexyl. @sharifhsn but we should update the comment for consistency: "modified to use the '⋄' character instead of '␀'"

FFh is also visually indistinguishable from 20h in this table, so for the Rewind path interpreter, we used U+FB00 LATIN SMALL LIGATURE FF (ff) for FFh.

@sharifhsn I think we should integrate the same adaptation here, if you agree?

@sharifhsn
Copy link
Contributor Author

@sharifhsn but we should update the comment for consistency: "modified to use the '⋄' character instead of '␀'"

Makes sense.

@sharifhsn I think we should integrate the same adaptation here, if you agree?

Ditto.

I'm now thinking that it might also be cool to have a mode in hexyl where we have distinct colors for every byte (not just for different categories). We could then print █ (U+2588) for every byte in the character table, but colorize it differently using ANSI escape sequences. As for this PR, we could maybe already add a --character-table block (or similar) variant where we simply print █ for each byte.

Great idea! I've implemented this using the xterm colors as there are 256 distinct colors, which maps perfectly to the number of distinct bytes. I'm open to a change in color scheme. Here's how it looks now:

image

@sharkdp
Copy link
Owner

sharkdp commented Jun 14, 2023

This is so cool.

Especially when combined with --border none, --no-squeezing, potentially --group-size 8 and --panels 8 on a wide screen:

image

But I would appreciated if we could focus this PR on --character-table and finish that first. Add a few basic tests. Make sure that it plays nicely with other features like e.g. squeezing. For example, I noticed that the squeezing character gets replaced by a block when --character-table=block. And maybe we should quickly check if this doesn't introduce any performance regression for the default case.

We can then explore the colorization part in a separate PR.

@sharifhsn
Copy link
Contributor Author

I added a test for code page 437, and I've fixed the issue with the asterisk, which was a leftover bug from my original implementation anyway 😅

I've also refactored the color components into a new module, since they're taking up a lot of visual space and have separated concerns anyway.

With regards to performance, preliminary benchmarks show that performance is within noise for the default case and for code page 437, and significantly slower with block coloring (due to the fact that the color changes every single byte).

Command Mean [s] Min [s] Max [s] Relative
target/release/hexyl --character-table codepage-437 data 2.246 ± 0.087 2.169 2.442 1.00
target/release/hexyl --character-table block data 4.412 ± 0.120 4.229 4.567 1.96 ± 0.09
target/release/hexyl data 2.490 ± 0.357 2.195 3.234 1.11 ± 0.16
hexyl data 2.274 ± 0.142 2.071 2.579 1.01 ± 0.07

With regards to colorization and the block table, it's all the same machinery. I would prefer to push this all in one PR unless there's some outstanding issue with the block table that can't be resolved quickly.

@sharkdp
Copy link
Owner

sharkdp commented Jul 23, 2023

I've also refactored the color components into a new module, since they're taking up a lot of visual space and have separated concerns anyway.

Thank you for the updates. I would still prefer to move the colorization part to a new PR.

@sharifhsn
Copy link
Contributor Author

@sharkdp The colorization of characters has been factored out. This PR should be ready to merge.

src/lib.rs Outdated
@@ -62,28 +61,31 @@ impl Byte {
}
}

fn color(self) -> &'static [u8] {
fn color(self, character_table: CharacterTable) -> &'static [u8] {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the character_table parameter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is left over from when I included distinct colored blocks as a character table option. I'll remove it now, good catch!

@@ -401,8 +415,7 @@ impl<'a, Writer: Write> Printer<'a, Writer> {
if self.show_position_panel {
match self.squeezer {
Squeezer::Print => {
self.writer
.write_all(self.byte_char_panel[b'*' as usize].as_bytes())?;
self.writer.write_all(&[b'*'])?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a leftover bug from my initial design. Originally, the colored characters were pre-colored and cached ahead of time. I changed the way that coloring works in #176 to write prefix color codes separately from the characters themselves. The caching is now unnecessary here.

Copy link
Owner

@sharkdp sharkdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! Just two minor questions.

@sharkdp sharkdp merged commit 2bab0c7 into sharkdp:master Dec 11, 2023
14 checks passed
@sharifhsn sharifhsn deleted the cp437 branch December 11, 2023 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants