latin1 causes error #140

B0Gec · 2022-08-27T10:42:58Z

B0Gec
Aug 27, 2022

Describe the bug
when signing non latin1 charaters cause error. I reccomend 'utf-8' instead of 'latin1' in fonts/basic.py

MatthiasValvekens · 2022-08-27T14:14:07Z

MatthiasValvekens
Aug 27, 2022
Maintainer

Hi @bostjangec, thanks for your comment. I'm assuming this is about how SimpleFontEngine encodes text?

PyHanko supports writing Unicode text, but unfortunately it's going to be a bit more complicated than just writing UTF-8 to the content stream.

PDF's text display features are older than Unicode, and displaying non-Latin text properly requires some effort. While there are a number of very simple "standard" fonts that (virtually) all PDF readers will offer, (oversimplifying a little bit) those all work with the Latin character set. That works fine for very simple things, but (as you have discovered) it doesn't really generalise well. This is also why pyHanko uses latin1 in SimpleFontEngine. That was a deliberate choice, since arbitrary UTF-8 probably wouldn't work in a lot of viewers anyhow.

Now, in your case, what you want to do is choose a font of your liking (that supports the characters you need), and embed a subset of it. PyHanko implements that using a font engine called GlyphAccumulator. There's a fairly straightforward example in the docs.

Under the hood, pyHanko will invoke HarfBuzz to handle shaping, and use that to translate your Unicode strings to PDF display operators ("regular" character encodings don't really enter into the equation). The font is then subsetted using fontTools and embedded into the file.

TL;DR: Text handling in PDF is complicated, and the output of SimpleFontEngine effectively can't handle non-Latin text. Use GlyphAccumulator instead.

EDIT: moved to discussion since this isn't a bug.

1 reply

MatthiasValvekens Sep 2, 2022
Maintainer

Minor addendum: I changed that line of code to use the encoder method from generic.TextStringObject instead of directly calling str.encode(...). Not so much because of the character encoding issue, but rather because generic.TextStringObject will handle escape sequences properly. Everything I wrote in my answer still applies, though :).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latin1 causes error #140

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

latin1 causes error #140

B0Gec Aug 27, 2022

Replies: 1 comment · 1 reply

MatthiasValvekens Aug 27, 2022 Maintainer

MatthiasValvekens Sep 2, 2022 Maintainer

B0Gec
Aug 27, 2022

Replies: 1 comment 1 reply

MatthiasValvekens
Aug 27, 2022
Maintainer

MatthiasValvekens Sep 2, 2022
Maintainer