theme	_class	paginate	backgroundColor	backgroundImage	style	marp
gaia	lead	true		url('https://marp.app/assets/hero-background.svg')	section.photo h1,section.photo h2,section.photo h3,section.photo h4,section.photo h5,section.photo h6 { background-color: #888; color: #FFF; } h6 { font-size: 30%; } img[alt~="centre"] { display: block; margin: 0 auto; }	true

Strings and OsStr: A wild ride through the history of Unicode

Jonathan Pallant

A Journey...

A String is just a String, right?
A Brief History of the String
Not all Strings are alike

A String is just a String, right?

String
Byte String
OS String
C Strings

String

let s: String = "Hi 😀!".to_owned();
dbg!(&s);
dbg!(s.len());
dbg!(s.bytes().count());
dbg!(s.chars().count());

▶️

A Vector of u8 inside
Iterates as 32-bit char

Byte String

let s: [u8; 13] = b"Hello, world!".to_owned();
dbg!(&s);
dbg!(s.len());

▶️

Iterates as octets (u8)
A Vector of octets (u8) inside

A Brief History of the String

The Punched Card

Character Encoding

Computers work in numbers
Humans like to write words
Words are made of characters
- Technically grapheme clusters
- Is ï one character or two?
We need a conversion table!
- AKA: A Character Set

American Standard Code for Information Interchange

Morse Code
Telegraph / Baudot codes
BCD
EBCDIC
ASA X3.4-1963
aka ASCII

An ASCII Table

What if we used the eighth-bit?

We get 128 more characters!

More standards are required...

MS-DOS Code Page 437, 850, ...
Windows Code Page 1252, 1250, ...
Macintosh Code Page 1275, 1282, ...

OK, one Standard to Rule Them All then

Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.

OK, let's go!

Microsoft used it in Windows
Sun used it in Java
Netscape used it in JavaScript
The Standard C Library added wcslen and friends

Unicode 2.0 in 1996...

Unicode Translation Format 16 (UTF-16) arrives

Isn't this the worst of everything?

Unit length != number of characters
Not ASCII compatible
Enter Plan 9 and UTF-8...

UTF-8

Variable-length encoding
Can encode any Unicode Scalar Value as one, two, three or four bytes.
Unit length != number of characters
0b0xxxxxxx
0b110xxxxx 0b10xxxxxx
0b1110xxxx 0b10xxxxxx 0b10xxxxxx

Are we done now?

POSIX says file names are an array of 8-bit values
Windows says file names are an array of 16-bit wchar_t
:(

Not all Strings are alike

String/&[str]/"hi"
- use this by default
Vec<u8>/&[u8]/b"hi"
- use for exchanging data with 8-bit / ASCII systems
OsString/OsStr
- use for exchanging data with your Operating System

C Strings?

CString/CStr
- use for exchanging data with 8-bit C APIs
- null-terminated
- Might not be UTF-8
https://docs.rs/widestring/
- use for exchanging data with 'wide' C APIs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strings.md

strings.md

Strings and OsStr: A wild ride through the history of Unicode

Jonathan Pallant

A Journey...

A String is just a String, right?

String

Byte String

A Brief History of the String

The Punched Card

Character Encoding

American Standard Code for Information Interchange

An ASCII Table

What if we used the eighth-bit?

More standards are required...

OK, one Standard to Rule Them All then

OK, let's go!

Unicode 2.0 in 1996...

Isn't this the worst of everything?

UTF-8

Are we done now?

Not all Strings are alike

C Strings?

Questions?

Files

strings.md

Latest commit

History

strings.md

File metadata and controls

Strings and OsStr: A wild ride through the history of Unicode

Jonathan Pallant

A Journey...

A String is just a String, right?

String

Byte String

A Brief History of the String

The Punched Card

Character Encoding

American Standard Code for Information Interchange

An ASCII Table

What if we used the eighth-bit?

More standards are required...

OK, one Standard to Rule Them All then

OK, let's go!

Unicode 2.0 in 1996...

Isn't this the worst of everything?

UTF-8

Are we done now?

Not all Strings are alike

C Strings?

Questions?