Changes for UTF8 character representation within string literals #74

Ian-Grant · 2024-01-26T17:35:40Z

Added UTF8 Unit in mosmllib
Modified PP.sml to correctly format multi-octed UTF8 strings (it needs to parse the UTF8 representations to do this)
Changed Lexer.lex to check UTF8 encodings in string literals and allow full ISO/IEC 1064 UCS encodings to be used in numerical escapes:

\U+XXXXXX

as well as

\uXXXX

as specified in the Standard ML Definition.

UTF8 checking within character strings is non-standard compiler behavior and the switch Meta.utf8 is provided to switch this checking on. Thinking about it now, the extended syntax for numeric character literals should probably be conditional on that too.

Some of the logic came from the HOL Theorem prover, but they are doing it differently now, see: src/portableML/UTF8.sml in https://github.com/HOL-Theorem-Prover/HOL

- Added UTF8 Unit in mosmllib - Changed Lexer.lex to check UTF8 encodings in string literals and allow full ISO/IEC 1064 UCS encodings to be used in numerical escapes: \U+XXXXXX as well as \uXXXX as specified in the Standard ML Definition. UTF8 checking within character strings is non-standard compiler behavior and the switch Meta.utf8 is provided to switch this checking on. Thinking about it now, the extended syntax for numeric character literals should probably be conditional on that too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for UTF8 character representation within string literals #74

Changes for UTF8 character representation within string literals #74

Ian-Grant commented Jan 26, 2024

Changes for UTF8 character representation within string literals #74

Are you sure you want to change the base?

Changes for UTF8 character representation within string literals #74

Conversation

Ian-Grant commented Jan 26, 2024