The ECMAScript Globalization API

November 10, 2011

Why a Globalization API?
Functionality of the API
Identification of Locales, Currencies, and Time Zones
The Globalization Object
Locale Lists
Locale and Parameter Negotiation
Collation
Number Formatting
Date and Time Formatting
Value Errors

A draft of an API specification to enable the globalization of JavaScript applications is now available for review by Ecma TC 39,the standards body that develops ECMAScript, the standard underlying JavaScript. The ECMAScript Globalization API supports collation (string comparison), number formatting, and date and time formatting, and lets applications choose the language and tailor the functionality to their needs. The API was developed by a working group with members from Google, Microsoft, Mozilla, and Amazon; I joined as an invited expert. It’s developed as a separate standard so that the API can be added to JavaScript runtimes that implement the fifth edition of the ECMAScript Language Specification, without waiting for the sixth edition to be completed and implemented.

Like its big brother, the ECMAScript Language Specification, the Globalization API Specification is written largely in the form of pseudocode, which makes it somewhat difficult to understand. This article provides background and a guide to the specification. Keep in mind though that the specification at this point is just a draft; it likely will change significantly before becoming a standard. You can help shape it by sending comments to the ECMAScript discussion mailing list.

Why a Globalization API?

The ECMAScript Language Specification offers little internationalization support. Strings are based on Unicode, which is a good start. There are a few locale sensitive functions, in particular String.prototype.localeCompare, Number.prototype.toLocaleString, and Date.prototype.toLocaleString with its date- and time-only variants. However, none of these functions let applications specify the language or control details of their behavior, so they’re pretty useless in practice. And that’s pretty much it.

Some JavaScript libraries, such as Dojo, Closure, or YUI, have filled in some of the gaps by providing their own number and date formatting as well as mechanisms for loading localized resources. Collation, however, requires large tables and complicated algorithms, and so the typical solution has been to send string lists back to a server with a good internationalization library and have them sorted there. This introduces delays, and doesn’t work at all for code that’s not connected to a server, such as JavaScript-based user interface extensions.

At the same time, ECMAScript implementations, whether they’re part of browsers, servers, or other systems, run on top of operating systems that already include comprehensive internationalization libraries. The goal of the ECMAScript Globalization API is to provide JavaScript applications with a standard interface to these internationalization libraries.

Functionality of the API

The functionality of the first version of the API is fairly limited and mirrors the functionality of the core language: Collation (string comparison), number formatting, and date and time formatting. However, it lets applications specify languages, assisting with language negotiation, as well as control details of the behavior. One major omission even in the provided functionality is comprehensive time zone support: Not all internationalization libraries offer this yet, so for now only UTC and the “host environment’s current time zone” (in a browser, that’s usually the time zone the user has set for the operating system) are supported.

The general usage pattern for the Globalization API is to create objects for the language sensitive services, requesting locales and other parameters, and then use a compare or format method of the constructed object. This pattern allows for efficient reuse of the completely configured objects, and also allows applications to query the objects for details of their configuration.

Retrofitting the existing locale sensitive methods on String, Number, and Date to take locale and other parameters would allow for a simpler usage pattern, but has to be done in the ECMAScript Language Specification. A proposal exists.

Identification of Locales, Currencies, and Time Zones

The API relies on widely used standards to identify locales and currencies: IETF Best Current Practice (BCP) 47, and ISO 4217.

BCP 47 consists of two RFCs, currently 5646 and 4647, and the IANA Language Subtag Registry. The language tags it defines are used in HTML, CSS, XML, HTTP, and other standard formats and protocols. In most cases, language tags are straightforward: a language code, possibly a script code, possibly a country code, all separated by hyphens: "de" for German, "zh-Hans" for simplified Chinese, "zh-Hans-SG" for simplified Chinese as used in Singapore. However, BCP 47 allows for extensions, and so far one has been defined: The "u" or Unicode extension, defined in RFC 6067 and Unicode Technical Standard 35. This extension allows the specification of additional parameters for collation, number formatting, and date and time formatting as part of a language tag, and so it has to be taken into consideration in the design of language and parameter negotiation in the API, as discussed below. The possibility of parameters that are not related to languages is the reason why the more generic term “locale” is used.

ISO 4217 defines both numeric and alphabetic codes for currencies, of which the Globalization API uses only the alphabetic 3-letter codes. The code list also provides information on the number of digits typically used to represent the minor units of a currency, e.g. 2 for the euro (which has 100 cents), or 0 for the yen (whose minor unit sen isn’t used anymore).

For time zones, there isn’t much to identify: This version of the API only supports UTC and the host environment’s current time zone. The first is identified by "UTC". There is no identifier for the second; it’s used when the time zone is left undefined.

The Globalization Object

The Globalization object provides a namespace for the constructors of the Globalization API – we expect that over time the API will add more constructors, and want to minimize the risk of name collisions. Its name started out as Internationalization or its uglified form i18n; we switched to Globalization because it’s shorter. Other candidates were Text, Presentation, Format, World, or 国际化 (nothing beats Chinese for brevity!).

The object is not extensible so that it can become a module once ECMAScript starts supporting those (as currently proposed, modules cannot be extended). Libraries that enhance globalization support therefore have to export the enhanced API themselves rather than patching the Globalization object.

Locale Lists

LocaleList objects contain lists of language tags. The language tags are stored as indexed elements, and a length property provides the number of elements, so that locale lists can be passed to functions that accept generic arrays, such as the functions in the Array prototype object. The constructor verifies that all language tags provided to it are well-formed (i.e., match the grammar of BCP 47 language tags), converts them into their canonical form so that they’re easy to compare and process, and removes duplicates. To ensure consistency, locale lists are immutable.

Applications do not have to use LocaleList objects; functions that accept them also accept generic arrays with language tags. However, using LocaleList objects can avoid the overhead of verifying and canonicalizing language tags repeatedly.

Applications can obtain the language tag for the host environment’s current locale by constructing a LocaleList object with no argument – the constructor then fills in the host environment’s current locale at index 0.

Locale and Parameter Negotiation

The constructors for the three locale sensitive services, Collator, NumberFormat, and DateTimeFormat, each take a locale list and an options argument. Together, these arguments form a request, which the constructors compare against the capabilities of their implementations to determine the actual locale and parameter settings to be used (a somewhat one-sided “negotiation”). Applications can find out about the result of the negotiation through the resolvedOptions accessor property of the constructed object. The request passed to the API may be fixed by the application (an application that only supports one language will request exactly that language), or may be the result of a higher-level negotiation between user preferences and application capabilities. The API does not deal with such higher-level negotiations.

For language negotiation, BCP 47 provides a simple “Lookup” algorithm, which compares a request consisting of a prioritized list of language tags with a set of available languages. It takes the language tags of the request in sequence and checks for each one whether it can be matched with one of the available languages either directly or with a fallback, where the fallback simply strips off subtags from the end, e.g., from "zh-Hans-SG" to "zh-Hans". The first match wins.

As mentioned earlier, the Unicode extension of BCP 47 allows a number of parameters to be set as part of a language tag. For example, the use of the euro with a simplified Chinese currency format can be specified as "zh-Hans-u-cu-eur", the use of the sort order for German phone books as "de-u-co-phonebk". Multiple parameters can be combined in one language tag, e.g., "de-u-co-phonebk-cu-usd" for German with phone book sorting and the U.S. dollar as currency.

The Unicode extension creates two issues:

The simple fallback used by the Lookup algorithm doesn’t work anymore because the parameter settings are largely independent of each other, not specializations of each other. The parameters have to be interpreted separately from the rest of the language tag.
Some of the parameters don’t really have anything to do with a language or locale; they’re orthogonal and applications should be able to fully control them (a language tag typically is a user setting). Currencies in particular depend on business requirements and should never be derived from a locale.

The algorithms involved in locale and parameter negotiation solve these issues in two ways:

Unicode extension subtag sequences are separated from the rest of a language tag. The Lookup algorithm is then applied to the remaining language tag to determine the language to be used, and the parameters set in the Unicode extension subtag sequence are negotiated separately.
The API distinguishes between three groups of parameters: those that are related to the locale and are always derived from the language tag, those that should be fully under application control and are solely obtained from the options object, and those that can be derived from the language tag, but also overridden by the application. Tables in the sections on Collator, NumberFormat, and DateTimeFormat below show for each parameter how it can be set.

The need to decide for each key in the Unicode extension how it should be treated in the API unfortunately means that the specification cannot allow implementations to support newly added keys.

The options object can contain not only properties corresponding to some of the keys of the Unicode extension, but also properties for other parameters that let the application control the behavior of the constructed object. Most parameters have default values (either provided by the specification or locale and implementation dependent), so that for the most common use cases the options object can be omitted.

Applications can get information about the results of locale and parameter negotiation through the resolvedOptions accessor property, which has properties for all parameters. The locale property contains a language tag with the locale that was selected among the implementation’s available locales, plus those Unicode extension parameters that were requested and are supported by the implementation. Applications can also use the supportedLocalesOf functions to determine which subset of a list of locales is supported by an implementation, possibly through fallbacks. There’s no function to get a list of all locales supported by an implementation as this list could be huge.

Collation

Collator objects support two usage scenarios: Sorting the strings in a list, and searching for matching strings in a set of strings. Sorting generally needs to be sensitive to minor differences in strings, such as diacritical marks or casing, so that it is clear whether pêche sorts before or after péché. In searching, on the other hand, such minor differences are often ignored. (Some languages, however, treat certain characters with diacritical marks as separate characters, thus considering the marks major differences.) For some languages, collation may define several sort orders, such as the dictionary and phone book sort orders for German, which differ in their handling of umlauted characters. In some applications, punctuation should be ignored in comparison.

The Unicode extension defines a number of additional parameters. Some of these parameters require special handling:

The "co" key can be used in language tags to select both usage and specific collations. In the Globalization API, these are separated: Usage is specified through the options object, so the values "standard" and "search" are ignored if used in language tags.
Implementations are not required to support the parameters backwards, caseLevel, numeric, hiraganaQuaternary, normalization, and caseFirst. If they don’t, they still have to check the values of their options properties so that erroneous values result in the same exceptions on all implementations, but their resolvedOptions properties do not report back the settings of unsupported parameters.
The variableTop parameter defined in the Unicode extension is not supported by Collator.

Collator uses the following parameters:

?Parameter	?Language Tag	?Options	Values
`locale`	✔	—	BCP 47 language tag
`usage` (collation)	`co` —	✔	`"sort"`, `"search"`
`sensitivity` (strength)	`ks` —	✔	`"base"`, `"accent"`, `"case"`, `"variant"`
`ignorePunctuation` (alternate handling)	`ka` —	✔	`true`, `false`
`collation`	`co` ✔	—	see UTS 35, except `"standard`", `"search"`
`backwards`	`kb` ✔	✔	`true`, `false`
`caseLevel`	`kc` ✔	✔	`true`, `false`
`numeric`	`kn` ✔	✔	`true`, `false`
`hiraganaQuaternary`	`kh` ✔	✔	`true`, `false`
`normalization`	`kk` ✔	✔	`true`, `false`
`caseFirst`	`kf` ✔	✔	`"upper"`, `"lower"`, `"false"`

Usage Examples

All examples assume the shortened constructor name:

var Coll = Globalization.Collator;

Sort array a according to the rules of the host environment’s current locale:

var collator = new Coll();

a.sort(function (x, y) {

return collator.compare(x, y);

});

Sort array a according to the rules for German phone books:

var collator = new Coll(["de-u-co-phonebk"]);

a.sort(function (x, y) {

return collator.compare(x, y);

});

Extract those elements from array a that are similar to string s according to the rules of language lang:

var collator = new Coll([lang], {usage: "search"});

var matches = a.filter(function (v) {

return collator.compare(v, s) === 0;

});

Number Formatting

NumberFormat objects support three format styles: plain decimal formatting, currency formatting, and percent formatting. For currency formatting the currency must be specified by the application because currency use depends on business requirements that NumberFormat cannot know about. Different numbering systems can be used, depending on implementation and locale: Western (“ASCII”) digits, (real) Arabic digits, Thai digits, Roman numerals, and more. The number of digits used in representing a number can be constrained, subject to the same limitations as for Number.prototype.toFixed and Number.prototype.toPrecision; defaults depend on the format style and, in the case of currency formatting, the currency being used. The use of grouping separators can be disabled.

NumberFormat uses the following parameters:

?Parameter	?Language Tag	?Options	Values
`locale`	✔	—	BCP 47 language tag
`numberingSystem`	`nu` ✔	—	see UTS 35
`style`	—	✔	`"decimal"`, `"currency"`, `"percent"`
`currency`	`cu` —	✔	ISO 4217 alphabetic code
`currencyDisplay`	—	✔	`true`, `false`
`minimumIntegerDigits`	—	✔	`0..21`
`minimumFractionDigits`	—	✔	`0..20`
`maximumFractionDigits`	—	✔	`0..20`
`minimumSignificantDigits`	—	✔	`1..21`
`maximumSignificantDigits`	—	✔	`1..21`
`useGrouping`	—	✔	`true`, `false`

Usage Examples

All examples assume the shortened constructor name:

var NF = Globalization.NumberFormat;

Format number n in decimal style with grouping separators according to the rules of the host environment’s current locale:

var format = new NF();

var result = format.format(n);

Format number n in decimal style with grouping separators according to the rules of language lang:

var format = new NF([lang]);

var result = format.format(n);

Format number n in currency style with grouping separators, with the localized currency symbol for Korean won, but with no fraction digits (the default for Korean won), according to the rules of language lang:

var format = new NF([lang], {style: "currency", currency: "KRW"});

var result = format.format(n);

Format number n in percent style with grouping separators, with at least 4 significant digits, according to Thai rules and with Thai digits:

var format = new NF(["th-u-nu-thai"], {style: "percent", minimumSignificantDigits: 4});

var result = format.format(n);

Date and Time Formatting

DateTimeFormat objects format a time value into a string using a subset of the following date and time components: weekday, era, year, month, day, hour, minute, second, and time zone name. Different representations of these components are available: unconstrained or 2-digit numeric; narrow, short, or long text. Implementations are required to support at least the following subsets:

weekday, year, month, day, hour, minute, second
weekday, year, month, day
year, month, day
year, month
month, day
hour, minute, second
hour, minute

Implementations may support other subsets, and requests will be negotiated against all available subset-representation combinations to find the best match. Different numbering systems can be used, as with NumberFormat. Hour representations can be forced from a locale-dependent default to 12-hour or 24-hour format. Implementations may support multiple calendars per locale; for time zones, they’re limited to UTC and the host environment’s current time zone.

DateTimeFormat uses the following parameters:

?Parameter	?Language Tag	?Options	Values
`locale`	✔	—	BCP 47 language tag
`calendar`	`ca` ✔	—	see UTS 35
`numberingSystem`	`nu` ✔	—	see UTS 35
`timeZone`	`tz` —	✔	`"UTC"`
`hour12`	—	✔	`true`, `false`
`weekday`	—	✔	`"narrow"`, `"short"`, `"long"`
`era`	—	✔	`"narrow"`, `"short"`, `"long"`
`year`	—	✔	`"2-digit"`, `"numeric"`
`month`	—	✔	`"2-digit"`, `"numeric"`, `"narrow"`, `"short"`, `"long"`
`day`	—	✔	`"2-digit"`, `"numeric"`
`hour`	—	✔	`"2-digit"`, `"numeric"`
`minute`	—	✔	`"2-digit"`, `"numeric"`
`second`	—	✔	`"2-digit"`, `"numeric"`
`timeZoneName`	—	✔	`"short"`, `"long"`

Usage Examples

All examples assume the shortened constructor name:

var DTF = Globalization.DateTimeFormat;

Format the current date with year, month, and day components in numeric format for the host environment’s current locale and time zone:

var format = new DTF();

var result = format.format();

Format time t with weekday, year, month, and day components in long format for the host environment’s current time zone according to Thai conventions with the Thai Buddhist calendar and Thai digits:

var format = new DTF(["th-u-ca-buddhist-nu-thai"], {weekday: "long", year: "long", month: "long", day: "long"});

var result = format.format(t);

Format the current time with hour, minute, and second components in 2-digit format and 24-hour time for the UTC time zone and the host environment’s current locale:

var format = new DTF(undefined, {hour: "2-digit", minute: "2-digit", second: "2-digit", hour12: false, timeZone: "UTC"});

var result = format.format();

Value Errors

The Globalization API accepts strings as specifications of locales and options, and needs the ability to report invalid values. Edition 5.1 of the ECMAScript Language Specification doesn’t offer error objects for this situation – SyntaxError objects are intended for errors in programming language source text, RangeError objects for numeric values, and TypeError objects for errors in the type of values or missing properties. The Globalization API specification therefore proposes a new error constructor ValueError, with the expectation that it will eventually migrate into the language specification.