The ECMAScript Internationalization API
February 26, 2012
- Why an Internationalization API?
- Functionality of the API
- Identification of Locales, Currencies, and Time Zones
- The Intl Object
- Locale Lists
- Locale and Parameter Negotiation
- Collation
- Number Formatting
- Date and Time Formatting
- Initializing Subclass Objects
- Changes from the Previous Version
A draft of an API specification to enable the internationalization of JavaScript applications is now available for review by Ecma TC 39, the standards body that develops ECMAScript, the standard underlying JavaScript. The ECMAScript Internationalization API Specification supports collation (string comparison), number formatting, and date and time formatting, and lets applications choose the language and tailor the functionality to their needs. The API was developed by a working group with members from Google, Microsoft, Mozilla, and Amazon; I joined as an invited expert. It’s developed as a separate standard so that the API can be added to JavaScript runtimes that implement the fifth edition of the ECMAScript Language Specification, without waiting for the sixth edition to be completed and implemented.
Like its big brother, the ECMAScript Language Specification, the Internationalization API Specification is written largely in the form of pseudocode, which makes it somewhat difficult to understand. This article provides background and a guide to the specification. Keep in mind though that the specification at this point is still a draft; it likely will change before becoming a standard. You can help shape it by sending comments to the ECMAScript discussion mailing list.
Why an Internationalization API?
The ECMAScript Language Specification offers little internationalization
support. Strings are based on Unicode, which is a good start. There are a
few locale sensitive functions, in particular String.prototype.localeCompare
, Number.prototype.toLocaleString
,
and Date.prototype.toLocaleString
with its date- and time-only
variants. However, none of these functions let applications specify the language
or control details of their behavior, so they’re pretty useless in practice.
And that’s pretty much it.
Some JavaScript libraries, such as Dojo, Closure, Globalize, or YUI, have filled in some of the gaps by providing their own number and date formatting as well as mechanisms for loading localized resources. Collation, however, requires large tables and complicated algorithms, and so the typical solution has been to send string lists back to a server with a good internationalization library and have them sorted there. This introduces delays, and doesn’t work at all for code that’s not connected to a server, such as JavaScript-based user interface extensions.
At the same time, ECMAScript implementations, whether they’re part of browsers, servers, or other systems, run on top of operating systems that already include comprehensive internationalization libraries. The goal of the ECMAScript Internationalization API is to provide JavaScript applications with a standard interface to these internationalization libraries.
Functionality of the API
The functionality of the first version of the API is fairly limited and mirrors the functionality of the core language: Collation (string comparison), number formatting, and date and time formatting. However, it lets applications specify languages, assisting with language negotiation, as well as control details of the behavior. One major omission even in the provided functionality is comprehensive time zone support: Not all internationalization libraries offer this yet, so for now only UTC and the “host environment’s current time zone” (in a browser, that’s usually the time zone the user has set for the operating system) are supported. Of other common internationalization functionality, resource loading is missing because ECMAScript doesn’t have an I/O system to build on, message construction is missing because the working group couldn’t converge on a solution fast enough, and normalization and other functionality are missing because their priority is lower than that of the chosen three.
The existing locale sensitive methods String.prototype.localeCompare
, Number.prototype.toLocaleString
,
and Date.prototype.toLocaleString
with its date- and time-only
variants were respecified to take locale and other parameters and interpret
them in the same way as the new API introduced by the Internationalization
API Specification.
The API can be used in two ways:
- Applications can use the existing methods on
String
,Number
, andDate
with the new parameters. This is simple, but does not let applications find out whether their requests can be satisfied exactly, and may result in slightly lower performance. - Applications can create objects for the language sensitive services,
requesting locales and other parameters, and then use a
compare
orformat
method of the constructed object. This pattern allows for efficient reuse of the completely configured objects, and also allows applications to query the objects for details of their configuration.
Identification of Locales, Currencies, and Time Zones
The API relies on widely used standards to identify locales and currencies: IETF Best Current Practice (BCP) 47, and ISO 4217.
BCP 47 consists of two RFCs, currently 5646 and 4647,
and the IANA
Language Subtag Registry. The language tags it defines are used in HTML,
CSS, XML, HTTP, and other standard formats and protocols. In most cases,
language tags are straightforward: A language code, possibly a script code,
possibly a country code, all separated by hyphens: "de"
for
German, "zh-Hans"
for simplified Chinese, "zh-Hans-SG"
for
simplified Chinese as used in Singapore. However, BCP 47 allows for extensions,
and so far two have been defined. The newer one of these, the "t"
or
Transformed Content extension defined in RFC
6497, can be safely ignored by the Internationalization API. On the other
hand, the "u"
or Unicode extension, defined in RFC
6067 and Unicode Technical
Standard 35, matters a lot. This extension allows the specification of
additional parameters for collation, number formatting, and date and time
formatting as part of a language tag, and so it has to be taken into consideration
in the design of language and parameter negotiation in the API, as discussed
below. The possibility of parameters that are not related to languages is
the reason why the more generic term “locale” is used.
ISO 4217 defines both numeric and alphabetic codes for currencies, of which the Internationalization API uses only the alphabetic 3-letter codes. The code list also provides information on the number of digits typically used to represent the minor units of a currency, e.g. 2 for the euro (which has 100 cents), or 0 for the yen (whose minor unit sen isn’t used anymore).
For time zones, there isn’t much to identify: This version of the API only
supports UTC and the host environment’s current time zone. The first is identified
by "UTC"
. There is no identifier for the second; it’s
used when the time zone is left undefined.
The Intl Object
The Intl
object provides a namespace for the constructors of
the Internationalization API – we expect that over time the API will add
more constructors, and want to minimize the risk of name collisions. Its
name started out as Internationalization
or its uglified form i18n
;
for a while we used Globalization
; finally we switched to Intl
because
it’s shorter. Other candidates were Text
, Presentation
, Format
, World
,
or 国际化
(nothing beats Chinese
for brevity!).
Locale Lists
LocaleList
objects contain lists of language tags. The language
tags are stored as indexed elements, and a length
property provides
the number of elements, so that locale lists can be passed to functions that
accept generic arrays, such as the functions in the Array
prototype
object. The constructor verifies that all language tags provided to it are
well-formed (i.e., match the grammar of BCP 47 language tags), converts them
into their canonical form so that they’re easy to compare and process, and
removes duplicates. To ensure consistency, locale lists are immutable.
Applications do not have to use LocaleList
objects; functions
that accept them also accept generic arrays with language tags. However,
using LocaleList
objects can avoid the overhead of verifying
and canonicalizing language tags repeatedly.
Applications can obtain the language tag for the host environment’s current
locale by constructing a LocaleList
object with no argument
– the constructor then fills in the host environment’s current locale at
index 0
.
The working group discussed extensively whether there should be an API to
set a default locale list that would then be used throughout the Internationalization API.
Two issues prevent this: First, a settable default locale list would create
a global communication channel between different scripts running within the
same environment, which is considered a security risk. Second, an application
may include different components, such as embedded apps, that need different
default locales. ECMAScript has no knowledge of these components and no way
to manage appropriate contexts for them. We decided therefore that a default
locale is better left to higher-level systems. For example, the YUI library
already includes an Intl
module which manages a list of requested
locales that is scoped to the containing YUI object and used for loading
resource bundles. This module could easily be modified to keep a locale list
object so that it can be used as a default within the scope of the containing
YUI object.
Usage Examples
Create a locale list with the single language tag for the language of a monolingual application, in this example Indonesian for Indonesia:
var locales = new LocaleList(["id-ID"]);
Create a locale list from the value of an HTTP Accept-Language header, one of the possible sources of information about the user’s language preference:
var sections = value.trim().split(/\s*,\s*/);
function filter(section) {
// throw out empty strings, special range "*", and unacceptable language ranges
return section !== "" && section[0] !== "*" && section.match(/;\s*q\s*=\s*(0(\.0{0,3})?)$/) === null;
}
function toRecord(section, index) {
// break out language range and qvalue for sorting. Include index for stable sorting.
var match = section.match(/^([a-zA-Z0-9\-]+)\s*(;\s*q\s*=\s*(0(\.[0-9]{0,3})?|1(\.0{0,3})?))?$/);
return {l: match[1], q: match[3] ? +match[3] : 1, i: index};
}
function compareRecord(a, b) {
// include index in comparison to ensure stable sorting
return (a.q < b.q) ? 1 : (a.q > b.q) ? -1 : (a.i - b.i);
}
function toLang(r) {
return r.l;
}
return new Intl.LocaleList(sections.filter(filter).map(toRecord).sort(compareRecord).map(toLang));
Locale and Parameter Negotiation
The constructors for the three locale sensitive services, Collator
, NumberFormat
,
and DateTimeFormat
, as well as the respecified locale sensitive
functions in String
, Number
, and Date
each
take a locale list and an options argument. Together, these arguments form
a request, which the constructors compare against the capabilities of their
implementations to determine the actual locale and parameter settings to
be used (a somewhat one-sided “negotiation”). Applications can find out about
the result of the negotiation for constructed objects through the resolvedOptions
accessor
property. The request passed to the API may be fixed by the application (an
application that only supports one language will request exactly that language),
or may be the result of a higher-level negotiation between user preferences
and application capabilities. The API does not deal with such higher-level
negotiations.
For language negotiation, BCP 47 provides a simple “lookup” algorithm, which
compares a request consisting of a prioritized list of language tags with
a set of available languages. It takes the language tags of the request in
sequence and checks for each one whether it can be matched with one of the
available languages either directly or with a fallback, where the fallback
simply strips off subtags from the end, e.g., from "zh-Hans-SG"
to "zh-Hans"
.
The first match wins.
The lookup algorithm doesn’t always provide the best possible results. For
example, if "es-GT"
(Spanish for Guatemala) is requested,
but not available, it falls back to "es"
, which is
typically implemented as Spanish for Spain. A better choice might be the
Spanish variant used in Guatemala’s neighbor Mexico, "es-MX"
.
The API specification therefore allows implementations to provide a better
“best fit” algorithm, and makes this algorithm the default.
As mentioned earlier, the Unicode extension of BCP 47 allows a number of
parameters to be set as part of a language tag. For example, the use of the
euro with a simplified Chinese currency format can be specified as "zh-Hans-u-cu-eur"
,
the use of the sort order for German phone books as "de-u-co-phonebk"
.
Multiple parameters can be combined in one language tag, e.g., "de-u-co-phonebk-cu-usd"
for
German with phone book sorting and the U.S. dollar as currency.
The Unicode extension creates two issues:
- The simple fallback used by the lookup algorithm doesn’t work anymore because the parameter settings are largely independent of each other, not specializations of each other. The parameters have to be interpreted separately from the rest of the language tag.
- Some of the parameters don’t really have anything to do with a language or locale; they’re orthogonal and applications should be able to fully control them (a language tag typically is a user setting). Currencies in particular depend on business requirements and should never be derived from a locale.
The algorithms involved in locale and parameter negotiation solve these issues in two ways:
- Unicode extension subtag sequences are separated from the rest of a language tag. Either the lookup algorithm or the best fit algorithm is then applied to the remaining language tag to determine the language to be used, and the parameters set in the Unicode extension subtag sequence are negotiated separately.
- The API distinguishes between three groups of parameters: those that
are related to the locale and are always derived from the language tag,
those that should be fully under application control and are solely obtained
from the options object, and those that can be derived from the language
tag, but also overridden by the application. Tables in the sections on
Collator
,NumberFormat
, andDateTimeFormat
below show for each parameter how it can be set.
The need to decide for each key in the Unicode extension how it should be treated in the API unfortunately means that the specification cannot allow implementations to support newly added keys.
The options object can contain not only properties corresponding to some of the keys of the Unicode extension, but also properties for other parameters that let the application control the behavior of the constructed object. Most parameters have default values (either provided by the specification or locale and implementation dependent), so that for the most common use cases the options object can be omitted.
Applications can get information about the results of locale and parameter
negotiation for constructed objects through the resolvedOptions
accessor
property, which has properties for all parameters except the matcher parameters.
The locale
property contains a language tag with the locale
that was selected among the implementation’s available locales, plus those
Unicode extension parameters that were requested and are supported by the
implementation. Applications can also use the supportedLocalesOf
functions
to determine which subset of a list of locales is supported by an implementation,
possibly through fallbacks. There’s no function to get a list of all locales
supported by an implementation as this list could be huge.
Collation
Collator
objects and String.prototype.localeCompare
support
two usage scenarios: Sorting the strings in a list, and searching for matching
strings in a set of strings. Sorting generally needs to be sensitive to minor
differences in strings, such as diacritical marks or casing, so that it is
clear whether pêche sorts before or after péché. In searching, on the other
hand, such minor differences are often ignored. (Some languages, however,
treat certain characters with diacritical marks as separate characters, thus
considering the marks major differences.) For some languages, collation may
define several sort orders, such as the dictionary and phone book sort orders
for German, which differ in their handling of umlauted characters. In some
applications, punctuation should be ignored in comparison.
The Unicode extension defines a number of additional parameters. Some of these parameters require special handling:
- The
"co"
key can be used in language tags to select both usage and specific collations. In the Internationalization API, these are separated: Usage is specified through the options object, so the values"standard"
and"search"
are ignored if used in language tags. - Implementations are not required to support the parameters
backwards
,caseLevel
,numeric
,hiraganaQuaternary
,normalization
, andcaseFirst
. If they don’t, they still have to check the values of their options properties so that erroneous values result in the same exceptions on all implementations, but theirresolvedOptions
properties do not report back the settings of unsupported parameters. - The
variableTop
parameter defined in the Unicode extension is not supported byCollator
.
Collator
and String.prototype.localeCompare
use
the following parameters:
?Parameter | ?Language Tag | ?Options | Values |
---|---|---|---|
locale |
✔ | — | BCP 47 language tag |
localeMatcher |
— | ✔ | "best fit" , "lookup" |
usage (collation) |
co — |
✔ | "sort" , "search" |
sensitivity (strength) |
ks — |
✔ | "base" , "accent" , "case" , "variant" |
ignorePunctuation (alternate handling) |
ka — |
✔ | true , false |
collation |
co ✔ |
— | see UTS 35, except "standard ", "search" |
backwards |
kb ✔ |
✔ | true , false |
caseLevel |
kc ✔ |
✔ | true , false |
numeric |
kn ✔ |
✔ | true , false |
hiraganaQuaternary |
kh ✔ |
✔ | true , false |
normalization |
kk ✔ |
✔ | true , false |
caseFirst |
kf ✔ |
✔ | "upper" , "lower" , "false" |
Usage Examples
Sort array a
according to the rules of the host environment’s
current locale:
var collator = new Intl.Collator();
a.sort(collator.compare);
Sort array a
according to the rules for German phone books:
var collator = new Intl.Collator(["de-u-co-phonebk"]);
a.sort(collator.compare);
The same could be done with String.prototype.localeCompare
,
but that doesn’t actually simplify the code because localeCompare
cannot
be passed directly to sort, and is less efficient because locale negotiation
has to be repeated for each string comparison:
a.sort(function (x, y) {
return x.localeCompare(y, ["de-u-co-phonebk"]);
});
Extract those elements from array a
that are similar to string s
according
to the rules of language lang
:
var collator = new Intl.Collator([lang], {usage: "search"});
var matches = a.filter(function (v) {
return collator.compare(v, s) === 0;
});
Check whether a newly constructed Collator
object supports
numeric sorting; if not, use a function that enhances the collator’s behavior
with this feature:
var collator = new Intl.Collator(locales, {numeric: true});
var f;
if (collator.resolvedOptions.numeric) {
f = collator.compare;
} else {
f = makeNumericCompare(collator);
}
a.sort(f);
Number Formatting
NumberFormat
objects and Number.prototype.toLocaleString
support
three format styles: plain decimal formatting, currency formatting, and percent
formatting. For currency formatting the currency must be specified by the
application because currency use depends on business requirements that NumberFormat
cannot
know about. Different numbering systems can be used, depending on implementation
and locale: Western (“ASCII”) digits, (real) Arabic digits, Thai digits,
Roman numerals, and more. The number of digits used to represent a number
can be constrained, subject to the same limitations as for Number.prototype.toFixed
and Number.prototype.toPrecision
;
defaults depend on the format style and, in the case of currency formatting,
the currency being used. The use of grouping separators can be disabled.
NumberFormat
and Number.prototype.toLocaleString
use
the following parameters:
?Parameter | ?Language Tag | ?Options | Values |
---|---|---|---|
locale |
✔ | — | BCP 47 language tag |
localeMatcher |
— | ✔ | "best fit" , "lookup" |
numberingSystem |
nu ✔ |
— | see UTS 35 |
style |
— | ✔ | "decimal" , "currency" , "percent" |
currency |
cu — |
✔ | ISO 4217 alphabetic code |
currencyDisplay |
— | ✔ | true , false |
minimumIntegerDigits |
— | ✔ | 1..21 |
minimumFractionDigits |
— | ✔ | 0..20 |
maximumFractionDigits |
— | ✔ | 0..20 |
minimumSignificantDigits |
— | ✔ | 1..21 |
maximumSignificantDigits |
— | ✔ | 1..21 |
useGrouping |
— | ✔ | true , false |
Usage Examples
Format number n
in decimal style with grouping separators according
to the rules of the host environment’s current locale:
var result = n.toLocaleString();
Format number n
in decimal style with grouping separators according
to the rules of language lang
:
var result = n.toLocaleString([lang]);
Format number n
in currency style with grouping separators,
with the localized currency symbol for Korean won, but with no fraction digits
(the default for Korean won), according to the rules of language lang
:
var result = n.toLocaleString([lang], {style: "currency", currency: "KRW"});
Apply the same format to a large array of numbers n
, where
reusing a NumberFormat
object is likely to yield better performance:
var format = new Intl.NumberFormat([lang], {style: "currency", currency: "KRW"});
var i, len = n.length, result = new Array(len);
for (i = 0; i < len; i++) {
result[i] = format.format(n[i]);
}
Format number n
in percent style with grouping separators,
with at least 4 significant digits, according to Thai rules and with Thai
digits:
var result = n.toLocaleString(["th-u-nu-thai"], {style: "percent", minimumSignificantDigits: 4});
Check whether a newly constructed NumberFormat
object supports
Tamil digits; if not, use a post-processing function:
var format = new Intl.NumberFormat(["ta-u-nu-tamldec"]);
var result = format.format(n);
if (format.resolvedOptions.numberingSystem !== "tamldec") {
result = convertToTamilDigits(result);
}
Date and Time Formatting
DateTimeFormat
objects and Date.prototype.toLocaleString
with
its date- and time-only variants format a time value into a string using
a subset of the following date and time components: weekday, era, year, month,
day, hour, minute, second, and time zone name. Different representations
of these components are available: unconstrained or 2-digit numeric; narrow,
short, or long text. Implementations are required to support at least the
following subsets:
- weekday, year, month, day, hour, minute, second
- weekday, year, month, day
- year, month, day
- year, month
- month, day
- hour, minute, second
- hour, minute
Implementations may support other subsets, and requests will be negotiated
against all available subset-representation combinations to find the best
match. Two algorithms are available for this negotiation: A fully specified
“basic” algorithm and an implementation dependent “best fit” algorithm (the
latter is the default). Different numbering systems can be used, as with NumberFormat
.
Hour representations can be forced from a locale-dependent default to 12-hour
or 24-hour format. Implementations may support multiple calendars per locale;
for time zones, they’re limited to UTC and the host environment’s current
time zone.
DateTimeFormat
and Date.prototype.toLocaleString
with
its date- and time-only variants use the following parameters:
?Parameter | ?Language Tag | ?Options | Values |
---|---|---|---|
locale |
✔ | — | BCP 47 language tag |
localeMatcher |
— | ✔ | "best fit" , "lookup" |
formatMatcher |
— | ✔ | "best fit" , "basic" |
calendar |
ca ✔ |
— | see UTS 35 |
numberingSystem |
nu ✔ |
— | see UTS 35 |
timeZone |
tz — |
✔ | "UTC" |
hour12 |
— | ✔ | true , false |
weekday |
— | ✔ | "narrow" , "short" , "long" |
era |
— | ✔ | "narrow" , "short" , "long" |
year |
— | ✔ | "2-digit" , "numeric" |
month |
— | ✔ | "2-digit" , "numeric" , "narrow" , "short" , "long" |
day |
— | ✔ | "2-digit" , "numeric" |
hour |
— | ✔ | "2-digit" , "numeric" |
minute |
— | ✔ | "2-digit" , "numeric" |
second |
— | ✔ | "2-digit" , "numeric" |
timeZoneName |
— | ✔ | "short" , "long" |
Usage Examples
Format the current date with year, month, and day components in numeric format for the host environment’s current locale and time zone:
var result = (new Date()).toLocaleDateString();
Format time t
with weekday, year, month, and day components
in long format for the host environment’s current time zone according to
Thai conventions with the Thai Buddhist calendar and Thai digits:
var result = (new Date(t)).toLocaleDateString(["th-u-ca-buddhist-nu-thai"], {weekday: "long", year: "long", month: "long", day: "long"});
Format the current time with hour, minute, and second components in 2-digit
format and 24-hour time for the UTC time zone and the best available locale
from list locales
:
var result = (new Date()).toLocaleTimeString(locales, {hour: "2-digit", minute: "2-digit", second: "2-digit", hour12: false, timeZone: "UTC"});
Check whether a newly constructed DateTimeFormat
object
supports the Islamic calendar; if not, use a function that does:
var format = new Intl.DateTimeFormat(["ar-EG-u-ca-islamicc"]);
var result;
if (format.resolvedOptions.calendar === "islamicc") {
result = format.format(new Date());
} else {
result = formatWithIslamicCalendar(new Date());
}
Initializing Subclass Objects
ECMAScript is not a class-based language, but developers have invented several ways of simulating class hierarchies. In this context it’s necessary to call the superclass constructor while actually constructing a subclass instance, and so use the superclass constructor as a function, not within a new expression. Unlike the constructors in the ECMAScript Language Specification, the constructors in the Internationalization API are designed to allow this usage:
function MyCollator(localeList, options) {
Intl.Collator.call(this, localeList, options);
// initialize MyCollator properties
}
MyCollator.prototype = Object.create(Intl.Collator.prototype);
MyCollator.prototype.constructor = MyCollator;
// add methods to MyCollator.prototype
var collator = new MyCollator(["de-u-co-phonebk"]);
a.sort(collator.compare);
Changes from the Previous Version
This article is an updated version of The ECMAScript Globalization API, which I published in November 2011, around the time the working group submitted the first draft of the specification to TC39. The following significant changes were made in the meantime:
- We changed the name from Globalization API to Internationalization API,
and the namespace from
Globalization
toIntl
. - The specification now respecifies the existing locale sensitive methods,
String.prototype.localeCompare
,Number.prototype.toLocaleString
, andDate.prototype.toLocaleString
with its date- and time-only variants. - The concept of “best fit” algorithms was introduced so that implementations can improve on the predefined algorithms for locale and date-time format matching.
Collator.prototype.compare
was changed so that its value can be passed directly toArray.prototype.sort
.- The constructors are now specified so that they can be called by subclass constructors.
- We dropped the idea of adding
ValueError
objects to handle errors in argument values, and expanded the use ofRangeError
instead. - The
Intl
object is now extensible – this may prevent it from becoming a module object in the future when ECMAScript supports modules, but we don’t know enough about that future yet to justify this limitation.