Emoji such as 👨👩👧👧 or 🇨🇦, or emoji with skin tones such as 👩🏾💻, use a combination of multiple Unicode code points to form a single glyph. To achieve this in the font, Apple Emoji uses ligatures, with data stored in the morx
table (which is specific to the Apple Advanced Typography spec)
Ligatures transform multiple glyphs into another single glyph. This glyph may not map back to any specific Unicode code point, but it may index into the sbix
table to give us the image we need.
For example, with the family: man, woman, girl, girl
emoji 👨👩👧👧:
- The Unicode code point sequence used to construct the emoji is
1F468 200D 1F469 200D 1F467 200D 1F467
- The corresponding glyph IDs are
1062, 43, 1164, 43, 1056, 43, 1056
- These glyph IDs are transformed through the ligature process into a single glyph ID,
1280
- The glyph
1280
maps to the image you see above in thesbix
table
State tables and state machines [reference]
The font uses state tables to represent a finite state machines, which is what processes sequences of glyphs into another glyph as a ligature.
Consider the Canadian flag emoji 🇨🇦. It's constructed from a sequence of 2 glyphs, the regional indicator C glyph and regional indicator A glyph (we'll refer to them as C
and A
throughout this doc, though they're actually 🇨
and 🇦
).
The state machine begins at a START
state, then transitions to a new SEEN-C
state when it sees C
, and from there, resolves the ligature if it sees A
as the next glyph.
Because there are a large number of possible sequences of emoji forming ligatures, it's inefficient to have unique transitions per glyph; instead, similar sets of glyphs are grouped into "classes" before being processed again. In the state machine representation in the font, all transitions are done through these classes (even if they sometimes only contain a single glyph).
Representation in the font (the morx
table) [reference]
The morx
table is split up into chains (in the current Apple Emoji font, there is only 1 chain to process), which is then split into the feature subtables (unused here) and actual subtables; of those, we are interested in the ligature subtables (type 2).
Each ligature subtable stores the representation of a state machine, which comprises the following tables (whose offsets are given by the subtable header):
- Class table: maps glyph IDs to classes
- State array: maps a given state and class to the index of the corresponding item in the entry table
- Entry table: contains information about the next state to transition to, and any actions that need to be taken
- Ligature action table: information on what action to take if specified by the entry table
- Component table: contains values that are accumulated and added up as the ligature is processed; the added result is then used as an index into the ligature table
- Ligature table: maps component results to the glyph ID of the final ligature
Class lookup tables [reference]
Class lookup tables map glyph IDs to the classes used to determine which transition to take. There are a few formats of lookup tables, 3 of which are used in the Apple Emoji font:
- Segment Array (Format 4): an array of segments covering
[startGlyph, endGlyph]
and mapping them to their respective classes in an array - Single Table (Format 6): an array of classes indexed directly by glyph ID
- Trimmed Array (Format 8): similar to the single table, but the first item in the array maps from
firstGlyph
instead of 0
The state array is a 2D array with the state ID as the row index, and the class as the column index. For example, if we are currently at state 5 and the next glyph has class 11, the value we would read is stateArray[5][11]
. The 2D array is stored in row-major order. The value read is the index into the entry table.
The entry table is an array of state entries, containing the next state to transition to as well as flags to determine whether additional actions need to be taken:
setComponent
(0x8000
) — Push the current glyph onto a stack (the "component stack") for processing later ondontAdvance
(0x4000
) — Go to the next state, but not the next glyphperformAction
(0x2000
) — Process the stored component stack into a ligature using the instructions in the ligature action table starting at indexligActionIndex
The ligature action contains instructions for processing a component stack (list of glyph IDs).
⚠️ Note: The index into the action table is the starting index of the instructions. If the component stack has 5 glyphs, the instructions are the 5 actions starting from the given index.
Each action contains couple of flags in addition to an offset
field which is used to pre-process the glyph ID. The algorithm to run through the actions is:
- Initialize the component accumulator to 0
- Iterate through the component stack. For each glyph ID, and its corresponding action:
- Compute the offset (
action & 0x3FFFFFFF
, then sign extend to 32 bits) - Add the offset to the glyph ID
- Use the result to the index into the component table to get the component value
- Add the component value to the component accumulator
- If the action has either the
last
orstore
flags, break the loop
- Compute the offset (
- Use the value of the component accumulator to index into the ligature table to get the final glyph ID
The component table is a simple array mapping the offseted glyph ID to component value.
The ligature table is an array that maps component accumulator sums to ligature glyph IDs.
Let's use 🤷🏽 (person shrugging: medium skin tone
) as an example, which is made up of code points U+1F937 U+1F3FD
, and assume the cmap
table tells us these are glyph IDs 2174
and 879
, respectively. We'll lay out the tables, then run through the algorithm.
Name | Value | Description |
---|---|---|
format |
4 |
Format 4 (segment array) lookup table |
binSrchHeader.unitSize |
6 |
Not used (used for binary searching the class table) |
binSrchHeader.nUnits |
2 |
The number of segments that follow |
binSrchHeader.searchRange |
12 |
Not used |
binSrchHeader.entrySelector |
1 |
Not used |
binSrchHeader.rangeShift |
0 |
Not used |
segments[0].lastGlyph |
2174 |
Last glyph of segment 0 is the Person Shrugging glyph |
segments[0].firstGlyph |
1881 |
First glyph of segment 0 is the Person Frowning glyph |
segments[0].value |
24 |
Start of the values array for segment 0 is at offset 14 |
segments[1].lastGlyph |
881 |
Last glyph of segment 0 is the dark skin tone modifier |
segments[1].firstGlyph |
877 |
First glyph of segment 0 is the light skin tone modifier |
segments[1].value |
610 |
Start of the values array for segment 1 is at offset 600 |
segments[0].values |
[3, 3, ..., 3] |
All glyphs from 1881 to 2174 have class 3 |
segments[1].values |
[5, 5, 5, 5] |
All skin tone modifiers have class 5 |
Class 0 | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | |
---|---|---|---|---|---|---|
State 0 | 0 | 6 | 5 | 0 | 0 | 0 |
State 1 | 0 | 4 | 0 | 1 | 4 | 0 |
State 2 | 13 | 12 | 0 | 0 | 3 | 2 |
Name | Value | Description |
---|---|---|
entry[0].nextStateIndex |
0 |
Entry 0: goto state 0 |
entry[0].entryFlags |
0 |
Entry 0: no flags |
entry[0].ligActionIndex |
0 |
Entry 0: no action |
entry[1].nextStateIndex |
2 |
Entry 1: goto state 2 |
entry[1].entryFlags |
0x8000 |
Entry 1: flag to set a component |
entry[1].ligActionIndex |
0 |
Entry 1: no action |
entry[2].nextStateIndex |
0 |
Entry 2: goto state 0 |
entry[2].entryFlags |
0xA000 |
Entry 2: flag to set a component and also run action |
entry[2].ligActionIndex |
0 |
Entry 2: start action at index 0 of the action table |
Name | Value |
---|---|
action[0] |
0x3FFFFC93 |
action[1] |
0xBFFFF786 |
Name | Value |
---|---|
component[0] |
0 |
component[1] |
1 |
component[2] |
2 |
component[3] |
5 |
component[4] |
6 |
Name | Value |
---|---|
ligature[0] |
153 |
ligature[1] |
154 |
ligature[2] |
155 |
ligature[3] |
156 |
ligature[4] |
1883 |
ligature[5] |
1884 |
ligature[6] |
1885 |
ligature[7] |
1886 |
ligature[8] |
1887 |
We start with glyph IDs [2174, 879]
. First, we'll need to find out which classes they belong to.
Looking at the class lookup table, we see that glyph 2174
falls within the range of segment 0, and has class 3. Similarly, glyph 879
is contained within segment 1, and has class 5.
The starting state index is 1, and our first glyph is class 3, so we find that cell in the 2D state array above, which has a value of 1. This tells us to look at entry 1 in the entry table, which as a nextStateIndex
of 2, and a flag of 0x8000
tells us to set a component. Our component stack is now [2174]
and we're on state index 2.
Indexing into the state array again with index 2 and class 5, we find a value of 2. The state entry with index 2 has a nextStateIndex
of 0 and flags 0xA000
. Since 0xA000
includes flag 0x8000
, we need to push this component to the stack as well, which is now [879, 2174]
. 0xA000
also includes flag 0x2000
, which means we need to perform the action specified by ligActionIndex
, which in this case is index 0
.
The action at index 0
is 0x3FFFFC93
. Sign extending this to 32 bits, we get -877
, and adding this to the first glyph in our stack results in 2
. Looking at the component table, index 2 is 2
, which we set as our component accumulator. The action doesn't have the last
or store
flag, so we move on to the next action.
action[1]
is 0xBFFFF786
. Sign extending the masked portion 0xBFFFF786 & 0x3FFFFFFF
gives -2170
, which we add to our next glyph in the stack to get 4
. component[4]
is 6
, and we add this to the accumulator, which gives the new value of 8
. This action has the last
flag, so we can look up the accumulator value in the ligature table. ligature[8]
is 1887
, and this is our resolved ligature glyph value.