Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode Registry definition #846

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add Unicode Registry definition #846

wants to merge 3 commits into from

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Jul 30, 2024

Closes #452
See also #845

Following the registry separations proposed in #634, this PR adds a new folder spec/registry/ and moves the current default function definitions there as default.md, as well as adding a new u: Unicode Registry as unicode.md.

Initially, this includes only the u:id, u:locale, and u:dir definitions from the design doc.

A new function context definition is added to Function Resolution, to allow for it to be affected by the u: options. This is intended to encapsulate the parts of the formatting context that are made available when calling functions.

Regarding the options, the text is almost directly as in the design doc. For u:locale, I specified that literal values matching the langtag rule from RFC 5646 are always supported, but that other tags and locale definitions MAY also be supported. This is roughly in line with what we have for e.g. :datetime operands, where we define something relatively strict that will work ~everywhere, while allowing other values beyond that definition to also be processed without error.

Note that u:locale and u:dir are ignored when used on markup.

A new test suite unicode.json is added; this also ends up testing our Default Bidi Strategy.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start. Mostly word-smithing suggestions...

@@ -16,7 +16,8 @@
1. [Data Model Errors](errors.md#data-model-errors)
1. [Resolution Errors](errors.md#resolution-errors)
1. [Message Function Errors](errors.md#message-function-errors)
1. [Default Function Registry](registry.md)
1. [Default Function Registry](registry/default.md)
1. [`u:` Unicode Registry](registry/unicode.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My design document is calling this the "Unicode Reserved Namespace" and "Unicode Reserved Namespace Registry", since, technically, all of the registries belong to Unicode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be completely fine renaming this later, e.g. when the design doc is actually approved. There's also the discussion in #677 about the "registry" term we ought to return to at some point.


If the resolved mapping of _options_ includes any `u:` options
supported by the implementation,
process them as specified in the [Unicode Registry](/spec/registry/unicode.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode Registry => Unicode Reserve Namespace registry

potentially including a fallback chain of locales.
- The base directionality of the _message_ and its _text_ tokens.

If the resolved mapping of _options_ includes any `u:` options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like special pleading on our part. The u: namespace is "just a namespace"?

Perhaps:

Suggested change
If the resolved mapping of _options_ includes any `u:` options
Implementations are encouraged to support _options_ defined in
the Unicode Reserved Namespace (`u:`).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not "just a namespace", because it needs special powers to affect the function context, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the "function context" context distinction is an implementation detail.

In most formatters implementations the locale and the options on how to format are passed to the constructor as parameters, at the same time.

I would rather look at this as "universal function parameters" that might be recognized and honored by several / all functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is that the current set of u: options behaves that way, but we might introduce one that isn't function context affecting in the future? If our intention to is to require that all such options be context-affecting, we should set that as a requirement in the design doc and elsewhere. We might need to contemplate an additional namespace in the future as well, although I can't think of any universal options just at the moment that we wouldn't just put in the default namespace.

Note: I am not disagreeing with doing this. Just making sure we're consistent and clear about it.

spec/formatting.md Show resolved Hide resolved
@@ -0,0 +1,63 @@
# MessageFormat 2.0 Unicode Registry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# MessageFormat 2.0 Unicode Registry
# MessageFormat 2.0 Unicode Reserved Namespace Registry

or an implementation-defined list of such tags.

Replaces the _locale_ defined in the _function context_ for this _expression_.
The value is ignored when set on _markup_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Why not leave it up to the implementer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note: this comment is about line 28)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that could introduce surprising differences in implementation behaviour and/or require us to define how markup open/close matching works. Consider this example, formatted as en:

French: {#span u:locale=fr}{$n :number}{/span}
English: {$n :number}

If we don't explicitly ignore u:locale and u:dir on markup, I would count it as likely that some implementations would format the first number in English, and others in French.

If that is desirable behaviour, then we would need to define open/close matching to ensure that the second number is still formatted in English.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the u:locale only affect the markup placeholder? Why would it affect the enclosed $n placeholder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something that at least @mihnita has expressed some interest in, and if we leave it up to implementations, that becomes a possible interpretation of what happens when spanning markup has u: options set on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example messages, the difference between the two is some markup. The formatting function should still format using the invoking (contextual) locale. For example, the invoking locale might be fr-CA-u-nu-cans and you wouldn't want the markup's lang attribute to remove the additional formatting (the example is contrived: I don't believe that the Cans script has its own digits). If you wanted to trim the locale, you should put it on the :number invocation. Otherwise markup can cause the formatting to change, which is spooky--and unlike anything else in MF2.

spec/registry/unicode.md Show resolved Hide resolved
spec/registry/unicode.md Outdated Show resolved Hide resolved
Comment on lines +39 to +43
Implementations MAY support additional language tags,
such as private-use or grandfathered tags,
or tags using `_` instead of `-` as a separator.
When the value of `u:locale` is set by a _variable_,
implementations MAY support non-string values otherwise representing locales.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some problems here. Private-use and grandfathered tags are already covered by both the langtag production and BCP47 in general.

I think you're trying to do several things here and we should split them apart:

Suggested change
Implementations MAY support additional language tags,
such as private-use or grandfathered tags,
or tags using `_` instead of `-` as a separator.
When the value of `u:locale` is set by a _variable_,
implementations MAY support non-string values otherwise representing locales.
Implementations MAY process the list of locale identifiers,
including interpretation of language tags or canonicalization of values
(such as mapping grandfathered tags to modern representations).
Implementations MAY convert or map the list to a prioritized list of
implementation-specific values.
For example, a JavaScript implementation might convert it into an `Intl.Locale`
while a Java implementation might convert the list into an array of `java.util.Locale` objects.
Implementations MAY reject unrecognized or unsupported values.
Implementation MAY support proprietary or implementation-specific
locale identifiers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the previous paragraph to note that as we're specifically referring to the langtag rule and not the Language-Tag rule, we're not requiring support for private-use or grandfathered tags.

The intended point of this passage is to make sure that values not matching u-locale-option may still be accepted, and explicitly calling out the extensions that are required for either BDP47 or ULI support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right move. We don't need to get into the locale identifier game. I think referencing langtag instead of Language-Tag is a backwards step.

Do you not agree with the various changes I suggested?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should not get into the locale identifier game. So much as we did with :datetime literals, I think we should identify a minimal definition that must be supported, and allow for implementations to go beyond that. And I think the langtag rule offers that minimum, i.e. something that can be supported by both a BCP47 and a ULI implementation, where both of those have features unsupported by the other.


Replaces the base directionality defined in
the _function context_ for this _expression_.
The value is ignored when set on _markup_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See thread above on u:locale.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't I be able to write:

I like {#html:strong u:dir=ltr}ASCII{/html:strong} as long as it's UTF-8.

And have it produce:

I like <strong dir=ltr>ASCII</strong> as long as it's UTF-8.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should write instead:

I like {#html:strong dir=ltr}ASCII{/html:strong} as long as it's UTF-8.

as the direction you're seeking to control is the direction of the contents of the <strong> rather than the tag itself. This is rather similar to my example above in #846 (comment):

{#span u:locale=fr}{$n :number}{/span}

If we allow u:dir and u:locale to be used as stand-ins for the HTML dir and lang attributes, we introduce unnecessary confusion about whether or not these options apply to the span between the open and close elements.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be referencing the Unicode BCP 47 locale identifier, not RFC 5646. (In particular, we don't want to force people to support extlang.

@aphillips
Copy link
Member

@macchiati Can you be specific? I think the 5646 reference is to the langtag grammar. We can (and should) refer to Unicode Locale for locale stuff.

@macchiati
Copy link
Member

Here is a reference. https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#BCP_47_Language_Tag_Conversion

I've felt for a while that we need to hoist the Unicode BCP 47 locale identifier definition, I think we could do that in this release.

@aphillips
Copy link
Member

That's not what I mean. Where in the text does it need to change?

Comment on lines +24 to +25
A comma-delimited list of BCP 47 language tags,
or an implementation-defined list of such tags.
Copy link
Member

@macchiati macchiati Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just edited for #846 (comment))
It needs stronger language than "which are expected to be".

Suggested change
A comma-delimited list of BCP 47 language tags,
or an implementation-defined list of such tags.
A comma-delimited prioritized list of [Unicode CLDR locale identifiers](https://www.unicode.org/reports/tr35/#BCP_47_Conformance).
Unicode CLDR locale identifiers MUST be well-formed and SHOULD be valid.
Note that **well-formed** Unicode CLDR locale identifiers are also **well-formed** BCP47 language tags,
and **valid** Unicode CLDR locale identifiers are also **valid** BCP47 language tags.

Addison, you might also add what you want to say about the implementation-defined way. I'm guessing you mean that with non-literal value like u:locale=$X, then internally $X could be represented in different ways depending on the implementation of message format.

@aphillips
Copy link
Member

@macchiati noted:

I think allowing "implementation-defined locale identifiers" is dangerous,

The text doesn't allow that. It allows the list to be formatted in an implementation defined way, e.g. a List<String> or Intl.Locale[]. This might be phrased better to make it clear, though.

I think the are-also formulation is overkill. We can just say "valid CLDR locale identifiers", although that places some burden on implementations that can guarantee langtag but don't have a local implementation of the CLDR rules. Well-formed tags might be acceptable in certain cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Constraints on @locale values
4 participants