The ECMAScript Internationalization API

December 18, 2012

The ECMAScript Internationalization API Specification provides collation (string comparison), number formatting, and date and time formatting for JavaScript applications, and lets applications choose the language and tailor the functionality to their needs. It is designed as an extension of the ECMAScript Language Specification, the standard underlying JavaScript, and can be added to JavaScript runtimes that implement the fifth edition of the ECMAScript Language Specification, without waiting for the sixth edition to be completed and implemented. The API was developed by a working group with members from Google, Microsoft, Mozilla, Amazon, and IBM, with myself joining first as an invited expert and then for Mozilla. The standard was approved by the Ecma General Assembly in December 2012 to become an official standard, ECMA-402. Implementations for several browsers are under way.

Like its big brother, the ECMAScript Language Specification, the Internationalization API Specification is written largely in the form of pseudocode, which makes it somewhat difficult to understand. This article provides background and a guide to the specification.

Why an Internationalization API?

The ECMAScript Language Specification offers little internationalization support. Strings are based on Unicode, which is a good start. There are a few locale sensitive functions, in particular String.prototype.localeCompare, Number.prototype.toLocaleString, and Date.prototype.toLocaleString with its date- and time-only variants. However, none of these functions let applications specify the language or control details of their behavior, so they’re pretty useless in practice. And that’s pretty much it.

Some JavaScript libraries, such as Dojo, Closure, Globalize, YUI, or Moment, have filled in some of the gaps by providing their own number and date formatting as well as mechanisms for loading localized resources. Collation, however, requires large tables and complicated algorithms, and so the typical solution has been to send string lists back to a server with a good internationalization library and have them sorted there. This introduces delays, and doesn’t work at all for code that’s not connected to a server, such as JavaScript-based user interface extensions.

At the same time, ECMAScript implementations, whether they’re part of browsers, servers, or other systems, run on top of operating systems that already include comprehensive internationalization libraries. The goal of the ECMAScript Internationalization API is to provide JavaScript applications with a standard interface to these internationalization libraries.

Functionality of the API

The functionality of the first version of the API is fairly limited and mirrors the functionality of the core language: Collation (string comparison), number formatting, and date and time formatting. However, it lets applications specify languages, assisting with language negotiation, as well as control details of the behavior. One major omission even in the provided functionality is comprehensive time zone support: Not all internationalization libraries offer this yet, so for now only UTC and the “host environment’s current time zone” (in a browser, that’s usually the time zone the user has set for the operating system) are supported. Of other common internationalization functionality, resource loading is missing because ECMAScript doesn’t have an I/O system to build on, message construction is missing because the working group couldn’t converge on a solution fast enough, and normalization and other functionality are missing because their priority is lower than that of the chosen three.

The existing locale sensitive methods String.prototype.localeCompare, Number.prototype.toLocaleString, and Date.prototype.toLocaleString with its date- and time-only variants were respecified to take locale and other parameters and interpret them in the same way as the new API introduced by the Internationalization API Specification.

The API can be used in two ways:

  • Applications can use the existing methods on String, Number, and Date with the new parameters. This is simple, but does not let applications find out whether their requests can be satisfied exactly, and may result in slightly lower performance.
  • Applications can create objects for the language sensitive services, requesting locales and other parameters, and then use a compare or format method of the constructed object. This pattern allows for efficient reuse of the completely configured objects, and also allows applications to query the objects for details of their configuration.

Identification of Locales, Currencies, and Time Zones

The API relies on widely used standards to identify locales and currencies: IETF Best Current Practice (BCP) 47, and ISO 4217.

BCP 47 consists of two RFCs, currently 5646 and 4647, and the IANA Language Subtag Registry. The language tags it defines are used in HTML, CSS, XML, HTTP, and other standard formats and protocols. In most cases, language tags are straightforward: A language code, possibly a script code, possibly a country code, all separated by hyphens: "de" for German, "zh-Hans" for simplified Chinese, "zh-Hans-SG" for simplified Chinese as used in Singapore. However, BCP 47 allows for extensions, and so far two have been defined. The newer one of these, the "t" or Transformed Content extension defined in RFC 6497, can be safely ignored by the Internationalization API. On the other hand, the "u" or Unicode extension, defined in RFC 6067 and Unicode Technical Standard 35, matters a lot. This extension allows the specification of additional parameters for collation, number formatting, and date and time formatting as part of a language tag, and so it has to be taken into consideration in the design of language and parameter negotiation in the API, as discussed below. The possibility of parameters that are not related to languages is the reason why the more generic term “locale” is used.

ISO 4217 defines both numeric and alphabetic codes for currencies, of which the Internationalization API uses only the alphabetic 3-letter codes. The code list also provides information on the number of digits typically used to represent the minor units of a currency, e.g. 2 for the euro (which has 100 cents), or 0 for the yen (whose minor unit sen isn’t used anymore).

For time zones, there isn’t much to identify: This version of the API only supports UTC and the host environment’s current time zone. The first is identified by "UTC". There is no identifier for the second; it’s used when the time zone is left undefined.

The Intl Object

The Intl object provides a namespace for the constructors of the Internationalization API – we expect that over time the API will add more constructors, and want to minimize the risk of name collisions. Its name started out as Internationalization or its uglified form i18n; for a while we used Globalization; finally we switched to Intl because it’s shorter. Other candidates were Text, Presentation, Format, World, or 国际化 (nothing beats Chinese for brevity!).

Locale Lists

The constructors in the Internationalization API as well as the updated locale sensitive methods on String, Number, and Date accept either single language tags or locale lists, which should reflect the user’s language preferences. Locale lists are represented as simple arrays of language tags. For a while, we were planning to have specialized LocaleList objects in the API, array-like objects that verified during their construction that all their elements were well-formed language tags, and canonicalized them for easier processing by other API. However, there wasn’t enough evidence that they really make processing that much easier and faster, and so the API now uses normal arrays throughout. Language tags are now verified and canonicalized by the constructors and methods that accept arrays of language tags.

The working group discussed extensively whether there should be an API to set a default locale list that would then be used throughout the Internationalization API. Two issues prevent this: First, a settable default locale list would create a global communication channel between different scripts running within the same environment, which is considered a security risk. Second, an application may include different components, such as embedded apps, that need different default locales. ECMAScript has no knowledge of these components and no way to manage appropriate contexts for them. We decided therefore that a default locale is better left to higher-level systems. For example, the YUI library already includes an Intl module which manages a list of requested locales that is scoped to the containing YUI object and used for loading resource bundles. This module could easily be modified to keep a locale list object so that it can be used as a default within the scope of the containing YUI object.

Locale and Parameter Negotiation

The constructors for the three locale sensitive services, Collator, NumberFormat, and DateTimeFormat, as well as the respecified locale sensitive functions in String, Number, and Date each take a language tag or locale list and an options argument. Together, these arguments form a request, which the constructors compare against the capabilities of their implementations to determine the actual locale and parameter settings to be used (a somewhat one-sided “negotiation”). Applications can find out about the result of the negotiation for constructed objects through the resolvedOptions method. The request passed to the API may be fixed by the application (an application that only supports one language will request exactly that language), or may be the result of a higher-level negotiation between user preferences and application capabilities. The API does not deal with such higher-level negotiations.

For language negotiation, BCP 47 provides a simple “lookup” algorithm, which compares a request consisting of a prioritized list of language tags with a set of available languages. It takes the language tags of the request in sequence and checks for each one whether it can be matched with one of the available languages either directly or with a fallback, where the fallback simply strips off subtags from the end, e.g., from "zh-Hans-SG" to "zh-Hans". The first match wins.

The lookup algorithm doesn’t always provide the best possible results. For example, if "es-GT" (Spanish for Guatemala) is requested, but not available, it falls back to "es", which is typically implemented as Spanish for Spain. A better choice might be the Spanish variant used in Guatemala’s neighbor Mexico, "es-MX". The API specification therefore allows implementations to provide a better “best fit” algorithm, and makes this algorithm the default.

As mentioned earlier, the Unicode extension of BCP 47 allows a number of parameters to be set as part of a language tag. For example, the use of the euro with a simplified Chinese currency format can be specified as "zh-Hans-u-cu-eur", the use of the sort order for German phone books as "de-u-co-phonebk". Multiple parameters can be combined in one language tag, e.g., "de-u-co-phonebk-cu-usd" for German with phone book sorting and the U.S. dollar as currency.

The Unicode extension creates two issues:

  • The simple fallback used by the lookup algorithm doesn’t work anymore because the parameter settings are largely independent of each other, not specializations of each other. The parameters have to be interpreted separately from the rest of the language tag.
  • Some of the parameters don’t really have anything to do with a language or locale; they’re orthogonal and applications should be able to fully control them (a language tag typically is a user setting). Currencies in particular depend on business requirements and should never be derived from a locale.

The algorithms involved in locale and parameter negotiation solve these issues in two ways:

  • Unicode extension subtag sequences are separated from the rest of a language tag. Either the lookup algorithm or the best fit algorithm is then applied to the remaining language tag to determine the language to be used, and the parameters set in the Unicode extension subtag sequence are negotiated separately.
  • The API distinguishes between three groups of parameters: those that are related to the locale and are always derived from the language tag, those that should be fully under application control and are solely obtained from the options object, and those that can be derived from the language tag, but also overridden by the application. Tables in the sections on Collator, NumberFormat, and DateTimeFormat below show for each parameter how it can be set.

The need to decide for each key in the Unicode extension how it should be treated in the API unfortunately means that the specification cannot allow implementations to support newly added keys.

The options object can contain not only properties corresponding to some of the keys of the Unicode extension, but also properties for other parameters that let the application control the behavior of the constructed object. Most parameters have default values (either provided by the specification or locale and implementation dependent), so that for the most common use cases the options object can be omitted.

Applications can get information about the results of locale and parameter negotiation for constructed objects through the resolvedOptions method, which returns an object with properties for all parameters except the matcher parameters. The locale property contains a language tag with the locale that was selected among the implementation’s available locales, plus those Unicode extension parameters that were requested and are supported by the implementation. Applications can also use the supportedLocalesOf functions to determine which subset of a list of locales is supported by an implementation, possibly through fallbacks. There’s no function to get a list of all locales supported by an implementation as this list could be huge.

Collation

Collator objects and String.prototype.localeCompare support two usage scenarios: Sorting the strings in a list, and searching for matching strings in a set of strings. Sorting generally needs to be sensitive to minor differences in strings, such as diacritical marks or casing, so that it is clear whether pêche sorts before or after péché. In searching, on the other hand, such minor differences are often ignored. (Some languages, however, treat certain characters with diacritical marks as separate characters, thus considering the marks major differences.) For some languages, collation may define several sort orders, such as the dictionary and phone book sort orders for German, which differ in their handling of umlauted characters. In some applications, punctuation should be ignored in comparison.

The Unicode extension defines a number of additional parameters. Some of these parameters require special handling:

  • The "co" key can be used in language tags to select both usage and specific collations. In the Internationalization API, these are separated: Usage is specified through the options object, so the values "standard" and "search" are ignored if used in language tags.
  • Implementations are not required to support the parameters numeric and caseFirst. If they don’t, they still have to check the values of their options properties so that erroneous values result in the same exceptions on all implementations, but their resolvedOptions methods do not report back the settings of unsupported parameters.
  • The backwards, caseLevel, hiragana­Quaternary, normalization, variableTop, and reorder parameters defined in the Unicode extension are not supported by Collator.

Collator and String.prototype.localeCompare use the following parameters:

?Pa­ram­e­ter ?Lan­guage Tag ?Options Values
locale BCP 47 lan­guage tag
locale­Matcher "best fit", "lookup"
usage (collation) co "sort", "search"
sensitivity (strength) ks "base", "accent", "case", "variant"
ignore­Punctua­tion (alternate handling) ka true, false
collation co see UTS 35, except "standard", "search"
numeric kn true, false
case­First kf "upper", "lower", "false"

Usage Examples

Sort array a according to the rules of the host environment’s current locale (the compare method is bound to its Collator so it can be passed directly to functions such as Array.prototype.sort):

var collator = new Intl​.Collator​();

a​.sort​(collator​.compare);

Sort array a according to the rules for German phone books:

var collator = new Intl​.Collator​("de-u-co-phonebk");

a​.sort​(collator​.compare);

The same could be done with String.prototype.localeCompare, but that doesn’t actually simplify the code because localeCompare cannot be passed directly to sort, and is less efficient because locale negotiation has to be repeated for each string comparison:

a​.sort​(function (x, y) {

return x​.localeCompare​(y, "de-u-co-phonebk");

});

Extract those elements from array a that are similar to string s according to the rules of language lang:

var collator = new Intl​.Collator​(lang, {usage: "search"});

var matches = a​.filter​(function (v) {

return collator​.compare​(v, s) === 0;

});

Check whether a newly constructed Collator object supports numeric sorting; if not, use a function that enhances the collator’s behavior with this feature:

var collator = new Intl​.Collator​(locales, {numeric: true});

var f;

if (collator​.resolvedOptions​().numeric) {

f = collator​.compare;

} else {

f = makeNumericCompare​(collator);

}

a​.sort​(f);

Number Formatting

NumberFormat objects and Number.prototype.toLocaleString support three format styles: plain decimal formatting, currency formatting, and percent formatting. For currency formatting the currency must be specified by the application because currency use depends on business requirements that NumberFormat cannot know about. Different numbering systems can be used, depending on implementation and locale: Western (“ASCII”) digits, (real) Arabic digits, Thai digits, Roman numerals, and more. The number of digits used to represent a number can be constrained, subject to the same limitations as for Number.prototype.toFixed and Number.prototype.toPrecision; defaults depend on the format style and, in the case of currency formatting, the currency being used. The use of grouping separators can be disabled.

NumberFormat and Number.prototype.toLocaleString use the following parameters:

?Pa­ram­e­ter ?Lan­guage Tag ?Options Values
locale BCP 47 lan­guage tag
locale­Matcher "best fit", "lookup"
numbering­System nu see UTS 35
style "decimal", "currency", "percent"
currency cu ISO 4217 alphabetic code
currency­Display "code", "symbol", "name"
minimum­Integer­Digits 1..21
minimum­Fraction­Digits 0..20
maximum­Fraction­Digits 0..20
minimum­Significant­Digits 1..21
maximum­Significant­Digits 1..21
use­Grouping true, false

Usage Examples

Format number n in decimal style with grouping separators according to the rules of the host environment’s current locale:

var result = n​.toLocaleString​();

Format number n in decimal style with grouping separators according to the rules of language lang:

var result = n​.toLocaleString​(lang);

Find the best available language among Marathi, Hindi, and English as used in India, and format number n in decimal style with grouping separators according to the rules of the selected language:

var result = n​.toLocaleString​(["mr", "hi", "en-IN"]);

Format number n in currency style with grouping separators, with the localized currency symbol for Korean won, but with no fraction digits (the default for Korean won), according to the rules of language lang:

var result = n​.toLocaleString​(lang, {style: "currency", currency: "KRW"});

Apply the same format to a large array of numbers n, where reusing a NumberFormat object is likely to yield better performance (the format method is bound to its NumberFormat object, so it can be passed directly to functions such as Array.prototype.map):

var format = new Intl​.NumberFormat​(lang, {style: "currency", currency: "KRW"});

var result = n​.map​(format​.format);

Format number n in percent style with grouping separators, with at least 4 significant digits, according to Thai rules and with Thai digits:

var result = n​.toLocaleString​("th-u-nu-thai", {style: "percent", minimumSignificantDigits: 4});

Check whether a newly constructed NumberFormat object supports Tamil digits; if not, use a post-processing function:

var format = new Intl​.NumberFormat​("ta-u-nu-tamldec");

var result = format​.format​(n);

if (format​.resolvedOptions​().numberingSystem !== "tamldec") {

result = convertToTamilDigits​(result);

}

Date and Time Formatting

DateTimeFormat objects and Date.prototype.toLocaleString with its date- and time-only variants format a time value into a string using a subset of the following date and time components: weekday, era, year, month, day, hour, minute, second, and time zone name. Different representations of these components are available: unconstrained or 2-digit numeric; narrow, short, or long text. Implementations are required to support at least the following subsets:

  • weekday, year, month, day, hour, minute, second
  • weekday, year, month, day
  • year, month, day
  • year, month
  • month, day
  • hour, minute, second
  • hour, minute

Implementations may support other subsets, and requests will be negotiated against all available subset-representation combinations to find the best match. Two algorithms are available for this negotiation: A fully specified “basic” algorithm and an implementation dependent “best fit” algorithm (the latter is the default). Different numbering systems can be used, as with NumberFormat. Hour representations can be forced from a locale-dependent default to 12-hour or 24-hour format. Implementations may support multiple calendars per locale; for time zones, they’re limited to UTC and the host environment’s current time zone.

DateTimeFormat and Date.prototype.toLocaleString with its date- and time-only variants use the following parameters:

?Pa­ram­e­ter ?Lan­guage Tag ?Options Values
locale BCP 47 lan­guage tag
locale­Matcher "best fit", "lookup"
format­Matcher "best fit", "basic"
calendar ca see UTS 35
numbering­System nu see UTS 35
time­Zone tz "UTC"
hour12 true, false
weekday "narrow", "short", "long"
era "narrow", "short", "long"
year "2-digit", "numeric"
month "2-digit", "numeric", "narrow", "short", "long"
day "2-digit", "numeric"
hour "2-digit", "numeric"
minute "2-digit", "numeric"
second "2-digit", "numeric"
time­Zone­Name "short", "long"

Usage Examples

Format the current date with year, month, and day components in numeric format for the host environment’s current locale and time zone:

var result = (new Date​()).toLocaleDateString​();

Format time t with weekday, year, month, and day components in long format for the host environment’s current time zone according to Thai conventions with the Thai Buddhist calendar and Thai digits:

var result = (new Date​(t)).toLocaleDateString​("th-u-ca-buddhist-nu-thai", {weekday: "long", year: "long", month: "long", day: "long"});

Format the current time with hour, minute, and second components in 2-digit format and 24-hour time for the UTC time zone and the best available locale from list locales:

var result = (new Date​()).toLocaleTimeString​(locales, {hour: "2-digit", minute: "2-digit", second: "2-digit", hour12: false, timeZone: "UTC"});

Check whether a newly constructed DateTimeFormat object supports the Islamic calendar; if not, use a function that does:

var format = new Intl​.DateTimeFormat​("ar-EG-u-ca-islamicc");

var result;

if (format​.resolvedOptions​().calendar === "islamicc") {

result = format​.format​(new Date​());

} else {

result = formatWithIslamicCalendar​(new Date​());

}

Initializing Subclass Objects

ECMAScript is not a class-based language, but developers have invented several ways of simulating class hierarchies. In this context it’s necessary to call the superclass constructor while actually constructing a subclass instance, and so use the superclass constructor as a function, not within a new expression. Unlike the constructors in the ECMAScript Language Specification, the constructors in the Internationalization API are designed to allow this usage:

 

function MyCollator​(locales, options) {

Intl​.Collator​.call​(this, locales, options);

// initialize MyCollator properties

}

MyCollator​.prototype = Object​.create​(Intl​.Collator​.prototype);

MyCollator​.prototype​.constructor = MyCollator;

// add methods to MyCollator​.prototype

 

var collator = new MyCollator​("de-u-co-phonebk");

a​.sort​(collator​.compare);

Detecting the Internationalization API

Detecting whether the new Intl object and its constructors are available is easy:

var hasIntl = this​.hasOwnProperty​("Intl") && typeof Intl === "object";

var hasCollator = hasIntl && Intl​.hasOwnProperty​("Collator");

Detecting whether the old methods of String, Number, and Date follow the new specification is a bit trickier. Their return values depend on the language they select and on the specific implementation of that language, so they’re hard to predict. You could use an indirect test to detect Intl and assume that if it’s there the old methods will have been updated as well, but that’s, well, a bit indirect. Now, there’s one aspect where the new specification requires predictable results: Rejection of invalid values for the new arguments. For example, "i" is an invalid language tag and must be rejected with a RangeError exception.

function hasNewLocaleCompare​() {

try {

"a".localeCompare​("b", "i");

} catch (e) {

return e​.name === "RangeError";

}

return false;

}

Can I Try?

You can: Google Chrome version 24, currently in beta, implements the API with Collator, NumberFormat, and DateTimeFormat objects as described above. Be aware though that it still fails quite a few tests of the ECMA-402 conformance test suite.

Changes from the Previous Version

Earlier versions of this article were published in October 2012, June 2012, February 2012, and November 2011. The following significant changes were made since the October version:

  • The specification has been approved by the Ecma General Assembly as standard ECMA-402.
  • Support for the Unicode locale extension key "kk" (normalization) was removed from Collator.
  • Google Chrome 24 has an unprefixed implementation.