Add Unicode Registry definition #846

eemeli · 2024-07-30T10:23:03Z

Closes #452
See also #845

Following the registry separations proposed in #634, this PR adds a new folder spec/registry/ and moves the current default function definitions there as default.md, as well as adding a new u: Unicode Registry as unicode.md.

Initially, this includes only the u:id, u:locale, and u:dir definitions from the design doc.

A new function context definition is added to Function Resolution, to allow for it to be affected by the u: options. This is intended to encapsulate the parts of the formatting context that are made available when calling functions.

Regarding the options, the text is almost directly as in the design doc. For u:locale, I specified that literal values matching the langtag rule from RFC 5646 are always supported, but that other tags and locale definitions MAY also be supported. This is roughly in line with what we have for e.g. :datetime operands, where we define something relatively strict that will work ~everywhere, while allowing other values beyond that definition to also be processed without error.

Note that u:locale and u:dir are ignored when used on markup.

A new test suite unicode.json is added; this also ends up testing our Default Bidi Strategy.

aphillips

Good start. Mostly word-smithing suggestions...

aphillips · 2024-07-30T15:15:01Z

spec/README.md

@@ -16,7 +16,8 @@
   1. [Data Model Errors](errors.md#data-model-errors)
   1. [Resolution Errors](errors.md#resolution-errors)
   1. [Message Function Errors](errors.md#message-function-errors)
-1. [Default Function Registry](registry.md)
+1. [Default Function Registry](registry/default.md)
+1. [`u:` Unicode Registry](registry/unicode.md)


My design document is calling this the "Unicode Reserved Namespace" and "Unicode Reserved Namespace Registry", since, technically, all of the registries belong to Unicode.

I'd be completely fine renaming this later, e.g. when the design doc is actually approved. There's also the discussion in #677 about the "registry" term we ought to return to at some point.

aphillips · 2024-07-30T15:15:40Z

spec/formatting.md

+
+   If the resolved mapping of _options_ includes any `u:` options
+   supported by the implementation,
+   process them as specified in the [Unicode Registry](/spec/registry/unicode.md).


Unicode Registry => Unicode Reserve Namespace registry

aphillips · 2024-07-30T15:18:31Z

spec/formatting.md

+     potentially including a fallback chain of locales.
+   - The base directionality of the _message_ and its _text_ tokens.
+
+   If the resolved mapping of _options_ includes any `u:` options


This seems like special pleading on our part. The u: namespace is "just a namespace"?

Perhaps:

Suggested change

If the resolved mapping of _options_ includes any `u:` options

Implementations are encouraged to support _options_ defined in

the Unicode Reserved Namespace (`u:`).

It's not "just a namespace", because it needs special powers to affect the function context, though.

I think that the "function context" context distinction is an implementation detail.

In most formatters implementations the locale and the options on how to format are passed to the constructor as parameters, at the same time.

I would rather look at this as "universal function parameters" that might be recognized and honored by several / all functions.

I think the problem here is that the current set of u: options behaves that way, but we might introduce one that isn't function context affecting in the future? If our intention to is to require that all such options be context-affecting, we should set that as a requirement in the design doc and elsewhere. We might need to contemplate an additional namespace in the future as well, although I can't think of any universal options just at the moment that we wouldn't just put in the default namespace.

Note: I am not disagreeing with doing this. Just making sure we're consistent and clear about it.

spec/formatting.md

aphillips · 2024-07-30T15:19:28Z

spec/registry/unicode.md

@@ -0,0 +1,63 @@
+# MessageFormat 2.0 Unicode Registry


Suggested change

# MessageFormat 2.0 Unicode Registry

# MessageFormat 2.0 Unicode Reserved Namespace Registry

aphillips · 2024-07-30T15:33:51Z

spec/registry/unicode.md

+or an implementation-defined list of such tags.
+
+Replaces the _locale_ defined in the _function context_ for this _expression_.
+The value is ignored when set on _markup_.


Why? Why not leave it up to the implementer?

(Note: this comment is about line 28)

Because that could introduce surprising differences in implementation behaviour and/or require us to define how markup open/close matching works. Consider this example, formatted as en:

French: {#span u:locale=fr}{$n :number}{/span} English: {$n :number}

If we don't explicitly ignore u:locale and u:dir on markup, I would count it as likely that some implementations would format the first number in English, and others in French.

If that is desirable behaviour, then we would need to define open/close matching to ensure that the second number is still formatted in English.

Wouldn't the u:locale only affect the markup placeholder? Why would it affect the enclosed $n placeholder?

It's something that at least @mihnita has expressed some interest in, and if we leave it up to implementations, that becomes a possible interpretation of what happens when spanning markup has u: options set on it.

In your example messages, the difference between the two is some markup. The formatting function should still format using the invoking (contextual) locale. For example, the invoking locale might be fr-CA-u-nu-cans and you wouldn't want the markup's lang attribute to remove the additional formatting (the example is contrived: I don't believe that the Cans script has its own digits). If you wanted to trim the locale, you should put it on the :number invocation. Otherwise markup can cause the formatting to change, which is spooky--and unlike anything else in MF2.

spec/registry/unicode.md

aphillips · 2024-07-30T15:44:11Z

spec/registry/unicode.md

+Implementations MAY support additional language tags,
+such as private-use or grandfathered tags,
+or tags using `_` instead of `-` as a separator.
+When the value of `u:locale` is set by a _variable_,
+implementations MAY support non-string values otherwise representing locales.


Some problems here. Private-use and grandfathered tags are already covered by both the langtag production and BCP47 in general.

I think you're trying to do several things here and we should split them apart:

Suggested change

Implementations MAY support additional language tags,

such as private-use or grandfathered tags,

or tags using `_` instead of `-` as a separator.

When the value of `u:locale` is set by a _variable_,

implementations MAY support non-string values otherwise representing locales.

Implementations MAY process the list of locale identifiers,

including interpretation of language tags or canonicalization of values

(such as mapping grandfathered tags to modern representations).

Implementations MAY convert or map the list to a prioritized list of

implementation-specific values.

For example, a JavaScript implementation might convert it into an `Intl.Locale`

while a Java implementation might convert the list into an array of `java.util.Locale` objects.

Implementations MAY reject unrecognized or unsupported values.

Implementation MAY support proprietary or implementation-specific

locale identifiers.

I updated the previous paragraph to note that as we're specifically referring to the langtag rule and not the Language-Tag rule, we're not requiring support for private-use or grandfathered tags.

The intended point of this passage is to make sure that values not matching u-locale-option may still be accepted, and explicitly calling out the extensions that are required for either BDP47 or ULI support.

I don't think this is the right move. We don't need to get into the locale identifier game. I think referencing langtag instead of Language-Tag is a backwards step.

Do you not agree with the various changes I suggested?

I agree that we should not get into the locale identifier game. So much as we did with :datetime literals, I think we should identify a minimal definition that must be supported, and allow for implementations to go beyond that. And I think the langtag rule offers that minimum, i.e. something that can be supported by both a BCP47 and a ULI implementation, where both of those have features unsupported by the other.

aphillips · 2024-07-30T15:45:13Z

spec/registry/unicode.md

+
+Replaces the base directionality defined in
+the _function context_ for this _expression_.
+The value is ignored when set on _markup_.


See thread above on u:locale.

Shouldn't I be able to write:

I like {#html:strong u:dir=ltr}ASCII{/html:strong} as long as it's UTF-8.

And have it produce:

I like <strong dir=ltr>ASCII</strong> as long as it's UTF-8.

You should write instead:

I like {#html:strong dir=ltr}ASCII{/html:strong} as long as it's UTF-8.

as the direction you're seeking to control is the direction of the contents of the <strong> rather than the tag itself. This is rather similar to my example above in #846 (comment):

{#span u:locale=fr}{$n :number}{/span}

If we allow u:dir and u:locale to be used as stand-ins for the HTML dir and lang attributes, we introduce unnecessary confusion about whether or not these options apply to the span between the open and close elements.

macchiati

We should be referencing the Unicode BCP 47 locale identifier, not RFC 5646. (In particular, we don't want to force people to support extlang.

aphillips · 2024-09-16T16:32:47Z

@macchiati Can you be specific? I think the 5646 reference is to the langtag grammar. We can (and should) refer to Unicode Locale for locale stuff.

macchiati · 2024-09-16T18:08:44Z

Here is a reference. https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#BCP_47_Language_Tag_Conversion

I've felt for a while that we need to hoist the Unicode BCP 47 locale identifier definition, I think we could do that in this release.

aphillips · 2024-09-16T18:23:23Z

That's not what I mean. Where in the text does it need to change?

macchiati · 2024-09-17T00:53:33Z

spec/registry/unicode.md

+A comma-delimited list of BCP 47 language tags,
+or an implementation-defined list of such tags.


(just edited for #846 (comment))
It needs stronger language than "which are expected to be".

Suggested change

A comma-delimited list of BCP 47 language tags,

or an implementation-defined list of such tags.

A comma-delimited prioritized list of [Unicode CLDR locale identifiers](https://www.unicode.org/reports/tr35/#BCP_47_Conformance).

Unicode CLDR locale identifiers MUST be well-formed and SHOULD be valid.

Note that **well-formed** Unicode CLDR locale identifiers are also **well-formed** BCP47 language tags,

and **valid** Unicode CLDR locale identifiers are also **valid** BCP47 language tags.

Addison, you might also add what you want to say about the implementation-defined way. I'm guessing you mean that with non-literal value like u:locale=$X, then internally $X could be represented in different ways depending on the implementation of message format.

aphillips · 2024-09-17T01:18:49Z

@macchiati noted:

I think allowing "implementation-defined locale identifiers" is dangerous,

The text doesn't allow that. It allows the list to be formatted in an implementation defined way, e.g. a List<String> or Intl.Locale[]. This might be phrased better to make it clear, though.

I think the are-also formulation is overkill. We can just say "valid CLDR locale identifiers", although that places some burden on implementations that can guarantee langtag but don't have a local implementation of the CLDR rules. Well-formed tags might be acceptable in certain cases.

eemeli added 2 commits July 30, 2024 12:35

Move spec/registry.md -> spec/registry/default.md

0117fee

Add Unicode Registry definition

f969ab6

eemeli added registry specification formatting test-suite labels Jul 30, 2024

aphillips reviewed Jul 30, 2024

View reviewed changes

Refer to BCP47, add note about only requiring normal tags

3a3d4ab

eemeli mentioned this pull request Aug 19, 2024

Accept attributes design & remove spec note #845

Merged

macchiati requested changes Sep 16, 2024

View reviewed changes

macchiati requested changes Sep 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unicode Registry definition #846

Add Unicode Registry definition #846

eemeli commented Jul 30, 2024

aphillips left a comment

aphillips Jul 30, 2024

eemeli Jul 30, 2024

aphillips Jul 30, 2024

aphillips Jul 30, 2024

eemeli Jul 30, 2024

mihnita Aug 5, 2024

aphillips Sep 16, 2024

aphillips Jul 30, 2024

aphillips Jul 30, 2024

aphillips Jul 30, 2024

eemeli Jul 30, 2024

aphillips Jul 30, 2024

eemeli Jul 30, 2024

aphillips Aug 7, 2024

aphillips Jul 30, 2024

eemeli Jul 30, 2024

aphillips Aug 7, 2024

eemeli Aug 19, 2024

aphillips Jul 30, 2024

eemeli Jul 30, 2024

aphillips Aug 7, 2024

eemeli Aug 19, 2024

macchiati left a comment

aphillips commented Sep 16, 2024

macchiati commented Sep 16, 2024

aphillips commented Sep 16, 2024

macchiati Sep 17, 2024 •

edited

Loading

aphillips commented Sep 17, 2024

	If the resolved mapping of _options_ includes any `u:` options
	Implementations are encouraged to support _options_ defined in
	the Unicode Reserved Namespace (`u:`).

	# MessageFormat 2.0 Unicode Registry
	# MessageFormat 2.0 Unicode Reserved Namespace Registry

-Implementations MAY support additional language tags,
-such as private-use or grandfathered tags,
-or tags using `_` instead of `-` as a separator.
-When the value of `u:locale` is set by a _variable_,
-implementations MAY support non-string values otherwise representing locales.
+Implementations MAY process the list of locale identifiers,
+including interpretation of language tags or canonicalization of values
+(such as mapping grandfathered tags to modern representations).
+Implementations MAY convert or map the list to a prioritized list of
+implementation-specific values.
+For example, a JavaScript implementation might convert it into an `Intl.Locale`
+while a Java implementation might convert the list into an array of `java.util.Locale` objects.
+Implementations MAY reject unrecognized or unsupported values.
+Implementation MAY support proprietary or implementation-specific
+locale identifiers.

		A comma-delimited list of BCP 47 language tags,
		or an implementation-defined list of such tags.

-A comma-delimited list of BCP 47 language tags,
-or an implementation-defined list of such tags.
+A comma-delimited prioritized list of [Unicode CLDR locale identifiers](https://www.unicode.org/reports/tr35/#BCP_47_Conformance).
+Unicode CLDR locale identifiers MUST be well-formed and SHOULD be valid.
+Note that **well-formed** Unicode CLDR locale identifiers are also **well-formed** BCP47 language tags,
+and **valid** Unicode CLDR locale identifiers are also **valid** BCP47 language tags.

Add Unicode Registry definition #846

Are you sure you want to change the base?

Add Unicode Registry definition #846

Conversation

eemeli commented Jul 30, 2024

aphillips left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati left a comment

Choose a reason for hiding this comment

aphillips commented Sep 16, 2024

macchiati commented Sep 16, 2024

aphillips commented Sep 16, 2024

macchiati Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

aphillips commented Sep 17, 2024

macchiati Sep 17, 2024 •

edited

Loading