Norbert’s Corner

The ECMAScript Internationalization API

February 26, 2012

A draft of an API specification to enable the internationalization of JavaScript applications is now available for review by Ecma TC 39, the standards body that develops ECMAScript, the standard underlying JavaScript. The ECMAScript Internationalization API Specification supports collation (string comparison), number formatting, and date and time formatting, and lets applications choose the language and tailor the functionality to their needs. The API was developed by a working group with members from Google, Microsoft, Mozilla, and Amazon; I joined as an invited expert. It’s developed as a separate standard so that the API can be added to JavaScript runtimes that implement the fifth edition of the ECMAScript Language Specification, without waiting for the sixth edition to be completed and implemented.

Like its big brother, the ECMAScript Language Specification, the Internationalization API Specification is written largely in the form of pseudocode, which makes it somewhat difficult to understand. This article provides background and a guide to the specification. Keep in mind though that the specification at this point is still a draft; it likely will change before becoming a standard. You can help shape it by sending comments to the ECMAScript discussion mailing list.

Why an Internationalization API?

The ECMAScript Language Specification offers little internationalization support. Strings are based on Unicode, which is a good start. There are a few locale sensitive functions, in particular String.prototype.localeCompare, Number.prototype.toLocaleString, and Date.prototype.toLocaleString with its date- and time-only variants. However, none of these functions let applications specify the language or control details of their behavior, so they’re pretty useless in practice. And that’s pretty much it.

Some JavaScript libraries, such as Dojo, Closure, Globalize, or YUI, have filled in some of the gaps by providing their own number and date formatting as well as mechanisms for loading localized resources. Collation, however, requires large tables and complicated algorithms, and so the typical solution has been to send string lists back to a server with a good internationalization library and have them sorted there. This introduces delays, and doesn’t work at all for code that’s not connected to a server, such as JavaScript-based user interface extensions.

At the same time, ECMAScript implementations, whether they’re part of browsers, servers, or other systems, run on top of operating systems that already include comprehensive internationalization libraries. The goal of the ECMAScript Internationalization API is to provide JavaScript applications with a standard interface to these internationalization libraries.

Functionality of the API

The functionality of the first version of the API is fairly limited and mirrors the functionality of the core language: Collation (string comparison), number formatting, and date and time formatting. However, it lets applications specify languages, assisting with language negotiation, as well as control details of the behavior. One major omission even in the provided functionality is comprehensive time zone support: Not all internationalization libraries offer this yet, so for now only UTC and the “host environment’s current time zone” (in a browser, that’s usually the time zone the user has set for the operating system) are supported. Of other common internationalization functionality, resource loading is missing because ECMAScript doesn’t have an I/O system to build on, message construction is missing because the working group couldn’t converge on a solution fast enough, and normalization and other functionality are missing because their priority is lower than that of the chosen three.

The existing locale sensitive methods String.prototype.localeCompare, Number.prototype.toLocaleString, and Date.prototype.toLocaleString with its date- and time-only variants were respecified to take locale and other parameters and interpret them in the same way as the new API introduced by the Internationalization API Specification.

The API can be used in two ways:

Identification of Locales, Currencies, and Time Zones

The API relies on widely used standards to identify locales and currencies: IETF Best Current Practice (BCP) 47, and ISO 4217.

BCP 47 consists of two RFCs, currently 5646 and 4647, and the IANA Language Subtag Registry. The language tags it defines are used in HTML, CSS, XML, HTTP, and other standard formats and protocols. In most cases, language tags are straightforward: A language code, possibly a script code, possibly a country code, all separated by hyphens: "de" for German, "zh-Hans" for simplified Chinese, "zh-Hans-SG" for simplified Chinese as used in Singapore. However, BCP 47 allows for extensions, and so far two have been defined. The newer one of these, the "t" or Transformed Content extension defined in RFC 6497, can be safely ignored by the Internationalization API. On the other hand, the "u" or Unicode extension, defined in RFC 6067 and Unicode Technical Standard 35, matters a lot. This extension allows the specification of additional parameters for collation, number formatting, and date and time formatting as part of a language tag, and so it has to be taken into consideration in the design of language and parameter negotiation in the API, as discussed below. The possibility of parameters that are not related to languages is the reason why the more generic term “locale” is used.

ISO 4217 defines both numeric and alphabetic codes for currencies, of which the Internationalization API uses only the alphabetic 3-letter codes. The code list also provides information on the number of digits typically used to represent the minor units of a currency, e.g. 2 for the euro (which has 100 cents), or 0 for the yen (whose minor unit sen isn’t used anymore).

For time zones, there isn’t much to identify: This version of the API only supports UTC and the host environment’s current time zone. The first is identified by "UTC". There is no identifier for the second; it’s used when the time zone is left undefined.

The Intl Object

The Intl object provides a namespace for the constructors of the Internationalization API – we expect that over time the API will add more constructors, and want to minimize the risk of name collisions. Its name started out as Internationalization or its uglified form i18n; for a while we used Globalization; finally we switched to Intl because it’s shorter. Other candidates were Text, Presentation, Format, World, or 国际化 (nothing beats Chinese for brevity!).

Locale Lists

LocaleList objects contain lists of language tags. The language tags are stored as indexed elements, and a length property provides the number of elements, so that locale lists can be passed to functions that accept generic arrays, such as the functions in the Array prototype object. The constructor verifies that all language tags provided to it are well-formed (i.e., match the grammar of BCP 47 language tags), converts them into their canonical form so that they’re easy to compare and process, and removes duplicates. To ensure consistency, locale lists are immutable.

Applications do not have to use LocaleList objects; functions that accept them also accept generic arrays with language tags. However, using LocaleList objects can avoid the overhead of verifying and canonicalizing language tags repeatedly.

Applications can obtain the language tag for the host environment’s current locale by constructing a LocaleList object with no argument – the constructor then fills in the host environment’s current locale at index 0.

The working group discussed extensively whether there should be an API to set a default locale list that would then be used throughout the Internationalization API. Two issues prevent this: First, a settable default locale list would create a global communication channel between different scripts running within the same environment, which is considered a security risk. Second, an application may include different components, such as embedded apps, that need different default locales. ECMAScript has no knowledge of these components and no way to manage appropriate contexts for them. We decided therefore that a default locale is better left to higher-level systems. For example, the YUI library already includes an Intl module which manages a list of requested locales that is scoped to the containing YUI object and used for loading resource bundles. This module could easily be modified to keep a locale list object so that it can be used as a default within the scope of the containing YUI object.

Usage Examples

Create a locale list with the single language tag for the language of a monolingual application, in this example Indonesian for Indonesia:

var locales = new LocaleList​(["id-ID"]);

Create a locale list from the value of an HTTP Accept-Language header, one of the possible sources of information about the user’s language preference:

var sections = value​.trim​().split​(/\s*,\s*/);

 

function filter​(section) {

// throw out empty strings, special range "*", and unacceptable language ranges

return section !== "" && section[0] !== "*" && section​.match​(/;\s*q\s*=\s*(0(\.0{0,3})?)$/) === null;

}

function toRecord​(section, index) {

// break out language range and qvalue for sorting. Include index for stable sorting.

var match = section​.match​(/^([a-zA-Z0-9\-]+)\s*(;\s*q\s*=\s*(0(\.[0-9]{0,3})?|1(\.0{0,3})?))?$/);

return {l: match[1], q: match[3] ? +match[3] : 1, i: index};

}

function compareRecord​(a, b) {

// include index in comparison to ensure stable sorting

return (a​.q < b​.q) ? 1 : (a​.q > b​.q) ? -1 : (a​.i - b​.i);

}

function toLang​(r) {

return r​.l;

}

 

return new Intl​.LocaleList​(sections​.filter​(filter).map​(toRecord).sort​(compareRecord).map​(toLang));

Locale and Parameter Negotiation

The constructors for the three locale sensitive services, Collator, NumberFormat, and DateTimeFormat, as well as the respecified locale sensitive functions in String, Number, and Date each take a locale list and an options argument. Together, these arguments form a request, which the constructors compare against the capabilities of their implementations to determine the actual locale and parameter settings to be used (a somewhat one-sided “negotiation”). Applications can find out about the result of the negotiation for constructed objects through the resolvedOptions accessor property. The request passed to the API may be fixed by the application (an application that only supports one language will request exactly that language), or may be the result of a higher-level negotiation between user preferences and application capabilities. The API does not deal with such higher-level negotiations.

For language negotiation, BCP 47 provides a simple “lookup” algorithm, which compares a request consisting of a prioritized list of language tags with a set of available languages. It takes the language tags of the request in sequence and checks for each one whether it can be matched with one of the available languages either directly or with a fallback, where the fallback simply strips off subtags from the end, e.g., from "zh-Hans-SG" to "zh-Hans". The first match wins.

The lookup algorithm doesn’t always provide the best possible results. For example, if "es-GT" (Spanish for Guatemala) is requested, but not available, it falls back to "es", which is typically implemented as Spanish for Spain. A better choice might be the Spanish variant used in Guatemala’s neighbor Mexico, "es-MX". The API specification therefore allows implementations to provide a better “best fit” algorithm, and makes this algorithm the default.

As mentioned earlier, the Unicode extension of BCP 47 allows a number of parameters to be set as part of a language tag. For example, the use of the euro with a simplified Chinese currency format can be specified as "zh-Hans-u-cu-eur", the use of the sort order for German phone books as "de-u-co-phonebk". Multiple parameters can be combined in one language tag, e.g., "de-u-co-phonebk-cu-usd" for German with phone book sorting and the U.S. dollar as currency.

The Unicode extension creates two issues:

The algorithms involved in locale and parameter negotiation solve these issues in two ways:

The need to decide for each key in the Unicode extension how it should be treated in the API unfortunately means that the specification cannot allow implementations to support newly added keys.

The options object can contain not only properties corresponding to some of the keys of the Unicode extension, but also properties for other parameters that let the application control the behavior of the constructed object. Most parameters have default values (either provided by the specification or locale and implementation dependent), so that for the most common use cases the options object can be omitted.

Applications can get information about the results of locale and parameter negotiation for constructed objects through the resolvedOptions accessor property, which has properties for all parameters except the matcher parameters. The locale property contains a language tag with the locale that was selected among the implementation’s available locales, plus those Unicode extension parameters that were requested and are supported by the implementation. Applications can also use the supportedLocalesOf functions to determine which subset of a list of locales is supported by an implementation, possibly through fallbacks. There’s no function to get a list of all locales supported by an implementation as this list could be huge.

Collation

Collator objects and String.prototype.localeCompare support two usage scenarios: Sorting the strings in a list, and searching for matching strings in a set of strings. Sorting generally needs to be sensitive to minor differences in strings, such as diacritical marks or casing, so that it is clear whether pêche sorts before or after péché. In searching, on the other hand, such minor differences are often ignored. (Some languages, however, treat certain characters with diacritical marks as separate characters, thus considering the marks major differences.) For some languages, collation may define several sort orders, such as the dictionary and phone book sort orders for German, which differ in their handling of umlauted characters. In some applications, punctuation should be ignored in comparison.

The Unicode extension defines a number of additional parameters. Some of these parameters require special handling:

Collator and String.prototype.localeCompare use the following parameters:

?Pa­ram­e­ter ?Lan­guage Tag ?Options Values
locale BCP 47 lan­guage tag
locale­Matcher "best fit", "lookup"
usage (collation) co "sort", "search"
sensitivity (strength) ks "base", "accent", "case", "variant"
ignore­Punctua­tion (alternate handling) ka true, false
collation co see UTS 35, except "standard", "search"
backwards kb true, false
caseLevel kc true, false
numeric kn true, false
hi­ra­ga­na­Qua­ter­nar­y kh true, false
norm­al­i­za­tion kk true, false
case­First kf "upper", "lower", "false"

Usage Examples

Sort array a according to the rules of the host environment’s current locale:

var collator = new Intl​.Collator​();

a​.sort​(collator​.compare);

Sort array a according to the rules for German phone books:

var collator = new Intl​.Collator​(["de-u-co-phonebk"]);

a​.sort​(collator​.compare);

The same could be done with String.prototype.localeCompare, but that doesn’t actually simplify the code because localeCompare cannot be passed directly to sort, and is less efficient because locale negotiation has to be repeated for each string comparison:

a​.sort​(function (x, y) {

return x​.localeCompare​(y, ["de-u-co-phonebk"]);

});

Extract those elements from array a that are similar to string s according to the rules of language lang:

var collator = new Intl​.Collator​([lang], {usage: "search"});

var matches = a​.filter​(function (v) {

return collator​.compare​(v, s) === 0;

});

Check whether a newly constructed Collator object supports numeric sorting; if not, use a function that enhances the collator’s behavior with this feature:

var collator = new Intl​.Collator​(locales, {numeric: true});

var f;

if (collator​.resolvedOptions​.numeric) {

f = collator​.compare;

} else {

f = makeNumericCompare​(collator);

}

a​.sort​(f);

Number Formatting

NumberFormat objects and Number.prototype.toLocaleString support three format styles: plain decimal formatting, currency formatting, and percent formatting. For currency formatting the currency must be specified by the application because currency use depends on business requirements that NumberFormat cannot know about. Different numbering systems can be used, depending on implementation and locale: Western (“ASCII”) digits, (real) Arabic digits, Thai digits, Roman numerals, and more. The number of digits used to represent a number can be constrained, subject to the same limitations as for Number.prototype.toFixed and Number.prototype.toPrecision; defaults depend on the format style and, in the case of currency formatting, the currency being used. The use of grouping separators can be disabled.

NumberFormat and Number.prototype.toLocaleString use the following parameters:

?Pa­ram­e­ter ?Lan­guage Tag ?Options Values
locale BCP 47 lan­guage tag
locale­Matcher "best fit", "lookup"
numbering­System nu see UTS 35
style "decimal", "currency", "percent"
currency cu ISO 4217 alphabetic code
currency­Display true, false
minimum­Integer­Digits 1..21
minimum­Fraction­Digits 0..20
maximum­Fraction­Digits 0..20
minimum­Significant­Digits 1..21
maximum­Significant­Digits 1..21
use­Grouping true, false

Usage Examples

Format number n in decimal style with grouping separators according to the rules of the host environment’s current locale:

var result = n​.toLocaleString​();

Format number n in decimal style with grouping separators according to the rules of language lang:

var result = n​.toLocaleString​([lang]);

Format number n in currency style with grouping separators, with the localized currency symbol for Korean won, but with no fraction digits (the default for Korean won), according to the rules of language lang:

var result = n​.toLocaleString​([lang], {style: "currency", currency: "KRW"});

Apply the same format to a large array of numbers n, where reusing a NumberFormat object is likely to yield better performance:

var format = new Intl​.NumberFormat​([lang], {style: "currency", currency: "KRW"});

var i, len = n​.length, result = new Array​(len);

for (i = 0; i < len; i++) {

result[i] = format​.format​(n[i]);

}

Format number n in percent style with grouping separators, with at least 4 significant digits, according to Thai rules and with Thai digits:

var result = n​.toLocaleString​(["th-u-nu-thai"], {style: "percent", minimumSignificantDigits: 4});

Check whether a newly constructed NumberFormat object supports Tamil digits; if not, use a post-processing function:

var format = new Intl​.NumberFormat​(["ta-u-nu-tamldec"]);

var result = format​.format​(n);

if (format​.resolvedOptions​.numberingSystem !== "tamldec") {

result = convertToTamilDigits​(result);

}

Date and Time Formatting

DateTimeFormat objects and Date.prototype.toLocaleString with its date- and time-only variants format a time value into a string using a subset of the following date and time components: weekday, era, year, month, day, hour, minute, second, and time zone name. Different representations of these components are available: unconstrained or 2-digit numeric; narrow, short, or long text. Implementations are required to support at least the following subsets:

Implementations may support other subsets, and requests will be negotiated against all available subset-representation combinations to find the best match. Two algorithms are available for this negotiation: A fully specified “basic” algorithm and an implementation dependent “best fit” algorithm (the latter is the default). Different numbering systems can be used, as with NumberFormat. Hour representations can be forced from a locale-dependent default to 12-hour or 24-hour format. Implementations may support multiple calendars per locale; for time zones, they’re limited to UTC and the host environment’s current time zone.

DateTimeFormat and Date.prototype.toLocaleString with its date- and time-only variants use the following parameters:

?Pa­ram­e­ter ?Lan­guage Tag ?Options Values
locale BCP 47 lan­guage tag
locale­Matcher "best fit", "lookup"
format­Matcher "best fit", "basic"
calendar ca see UTS 35
numbering­System nu see UTS 35
time­Zone tz "UTC"
hour12 true, false
weekday "narrow", "short", "long"
era "narrow", "short", "long"
year "2-digit", "numeric"
month "2-digit", "numeric", "narrow", "short", "long"
day "2-digit", "numeric"
hour "2-digit", "numeric"
minute "2-digit", "numeric"
second "2-digit", "numeric"
time­Zone­Name "short", "long"

Usage Examples

Format the current date with year, month, and day components in numeric format for the host environment’s current locale and time zone:

var result = (new Date​()).toLocaleDateString​();

Format time t with weekday, year, month, and day components in long format for the host environment’s current time zone according to Thai conventions with the Thai Buddhist calendar and Thai digits:

var result = (new Date​(t)).toLocaleDateString​(["th-u-ca-buddhist-nu-thai"], {weekday: "long", year: "long", month: "long", day: "long"});

Format the current time with hour, minute, and second components in 2-digit format and 24-hour time for the UTC time zone and the best available locale from list locales:

var result = (new Date​()).toLocaleTimeString​(locales, {hour: "2-digit", minute: "2-digit", second: "2-digit", hour12: false, timeZone: "UTC"});

Check whether a newly constructed DateTimeFormat object supports the Islamic calendar; if not, use a function that does:

var format = new Intl​.DateTimeFormat​(["ar-EG-u-ca-islamicc"]);

var result;

if (format​.resolvedOptions​.calendar === "islamicc") {

result = format​.format​(new Date​());

} else {

result = formatWithIslamicCalendar​(new Date​());

}

Initializing Subclass Objects

ECMAScript is not a class-based language, but developers have invented several ways of simulating class hierarchies. In this context it’s necessary to call the superclass constructor while actually constructing a subclass instance, and so use the superclass constructor as a function, not within a new expression. Unlike the constructors in the ECMAScript Language Specification, the constructors in the Internationalization API are designed to allow this usage:

function MyCollator​(localeList, options) {

Intl​.Collator​.call​(this, localeList, options);

// initialize MyCollator properties

}

MyCollator​.prototype = Object​.create​(Intl​.Collator​.prototype);

MyCollator​.prototype​.constructor = MyCollator;

// add methods to MyCollator​.prototype

 

var collator = new MyCollator​(["de-u-co-phonebk"]);

a​.sort​(collator​.compare);

Changes from the Previous Version

This article is an updated version of The ECMAScript Globalization API, which I published in November 2011, around the time the working group submitted the first draft of the specification to TC39. The following significant changes were made in the meantime: