The ECMAScript Globalization API
November 10, 2011
- Why a Globalization API?
- Functionality of the API
- Identification of Locales, Currencies, and Time Zones
- The Globalization Object
- Locale Lists
- Locale and Parameter Negotiation
- Collation
- Number Formatting
- Date and Time Formatting
- Value Errors
A draft of an API specification to enable the globalization of JavaScript applications is now available for review by Ecma TC 39,the standards body that develops ECMAScript, the standard underlying JavaScript. The ECMAScript Globalization API supports collation (string comparison), number formatting, and date and time formatting, and lets applications choose the language and tailor the functionality to their needs. The API was developed by a working group with members from Google, Microsoft, Mozilla, and Amazon; I joined as an invited expert. It’s developed as a separate standard so that the API can be added to JavaScript runtimes that implement the fifth edition of the ECMAScript Language Specification, without waiting for the sixth edition to be completed and implemented.
Like its big brother, the ECMAScript Language Specification, the Globalization API Specification is written largely in the form of pseudocode, which makes it somewhat difficult to understand. This article provides background and a guide to the specification. Keep in mind though that the specification at this point is just a draft; it likely will change significantly before becoming a standard. You can help shape it by sending comments to the ECMAScript discussion mailing list.
Why a Globalization API?
The ECMAScript Language Specification offers little internationalization
support. Strings are based on Unicode, which is a good start. There are a
few locale sensitive functions, in particular String.prototype.localeCompare
, Number.prototype.toLocaleString
,
and Date.prototype.toLocaleString
with its date- and time-only
variants. However, none of these functions let applications specify the language
or control details of their behavior, so they’re pretty useless in practice.
And that’s pretty much it.
Some JavaScript libraries, such as Dojo, Closure, or YUI, have filled in some of the gaps by providing their own number and date formatting as well as mechanisms for loading localized resources. Collation, however, requires large tables and complicated algorithms, and so the typical solution has been to send string lists back to a server with a good internationalization library and have them sorted there. This introduces delays, and doesn’t work at all for code that’s not connected to a server, such as JavaScript-based user interface extensions.
At the same time, ECMAScript implementations, whether they’re part of browsers, servers, or other systems, run on top of operating systems that already include comprehensive internationalization libraries. The goal of the ECMAScript Globalization API is to provide JavaScript applications with a standard interface to these internationalization libraries.
Functionality of the API
The functionality of the first version of the API is fairly limited and mirrors the functionality of the core language: Collation (string comparison), number formatting, and date and time formatting. However, it lets applications specify languages, assisting with language negotiation, as well as control details of the behavior. One major omission even in the provided functionality is comprehensive time zone support: Not all internationalization libraries offer this yet, so for now only UTC and the “host environment’s current time zone” (in a browser, that’s usually the time zone the user has set for the operating system) are supported.
The general usage pattern for the Globalization API is to create objects
for the language sensitive services, requesting locales and other parameters,
and then use a compare
or format
method of the
constructed object. This pattern allows for efficient reuse of the completely
configured objects, and also allows applications to query the objects for
details of their configuration.
Retrofitting the existing locale sensitive methods on String, Number, and Date to take locale and other parameters would allow for a simpler usage pattern, but has to be done in the ECMAScript Language Specification. A proposal exists.
Identification of Locales, Currencies, and Time Zones
The API relies on widely used standards to identify locales and currencies: IETF Best Current Practice (BCP) 47, and ISO 4217.
BCP 47 consists of two RFCs, currently 5646 and 4647,
and the IANA
Language Subtag Registry. The language tags it defines are used in HTML,
CSS, XML, HTTP, and other standard formats and protocols. In most cases,
language tags are straightforward: a language code, possibly a script code,
possibly a country code, all separated by hyphens: "de"
for
German, "zh-Hans"
for simplified Chinese, "zh-Hans-SG"
for
simplified Chinese as used in Singapore. However, BCP 47 allows for extensions,
and so far one has been defined: The "u"
or Unicode
extension, defined in RFC 6067 and Unicode
Technical Standard 35. This extension allows the specification of additional
parameters for collation, number formatting, and date and time formatting
as part of a language tag, and so it has to be taken into consideration in
the design of language and parameter negotiation in the API, as discussed
below. The possibility of parameters that are not related to languages is
the reason why the more generic term “locale” is used.
ISO 4217 defines both numeric and alphabetic codes for currencies, of which the Globalization API uses only the alphabetic 3-letter codes. The code list also provides information on the number of digits typically used to represent the minor units of a currency, e.g. 2 for the euro (which has 100 cents), or 0 for the yen (whose minor unit sen isn’t used anymore).
For time zones, there isn’t much to identify: This version of the API only
supports UTC and the host environment’s current time zone. The first is identified
by "UTC"
. There is no identifier for the second; it’s
used when the time zone is left undefined.
The Globalization Object
The Globalization
object provides a namespace for the constructors
of the Globalization API – we expect that over time the API will add more
constructors, and want to minimize the risk of name collisions. Its name
started out as Internationalization
or its uglified form i18n
;
we switched to Globalization
because it’s shorter. Other candidates
were Text
, Presentation
, Format
, World
,
or 国际化
(nothing beats Chinese
for brevity!).
The object is not extensible so that it can become a module once ECMAScript starts supporting those (as currently proposed, modules cannot be extended). Libraries that enhance globalization support therefore have to export the enhanced API themselves rather than patching the Globalization object.
Locale Lists
LocaleList
objects contain lists of language tags. The language
tags are stored as indexed elements, and a length
property provides
the number of elements, so that locale lists can be passed to functions that
accept generic arrays, such as the functions in the Array
prototype
object. The constructor verifies that all language tags provided to it are
well-formed (i.e., match the grammar of BCP 47 language tags), converts them
into their canonical form so that they’re easy to compare and process, and
removes duplicates. To ensure consistency, locale lists are immutable.
Applications do not have to use LocaleList
objects; functions
that accept them also accept generic arrays with language tags. However, using LocaleList
objects can
avoid the overhead of verifying and canonicalizing language tags repeatedly.
Applications can obtain the language tag for the host environment’s current
locale by constructing a LocaleList
object with no argument
– the constructor then fills in the host environment’s current locale at
index 0.
Locale and Parameter Negotiation
The constructors for the three locale sensitive services, Collator
, NumberFormat
,
and DateTimeFormat
, each take a locale list and an options argument.
Together, these arguments form a request, which the constructors compare
against the capabilities of their implementations to determine the actual
locale and parameter settings to be used (a somewhat one-sided “negotiation”).
Applications can find out about the result of the negotiation through the resolvedOptions
accessor
property of the constructed object. The request passed to the API may be
fixed by the application (an application that only supports one language
will request exactly that language), or may be the result of a higher-level
negotiation between user preferences and application capabilities. The API
does not deal with such higher-level negotiations.
For language negotiation, BCP 47 provides a simple “Lookup” algorithm, which
compares a request consisting of a prioritized list of language tags with
a set of available languages. It takes the language tags of the request in
sequence and checks for each one whether it can be matched with one of the
available languages either directly or with a fallback, where the fallback
simply strips off subtags from the end, e.g., from "zh-Hans-SG"
to "zh-Hans"
.
The first match wins.
As mentioned earlier, the Unicode extension of BCP 47 allows a number of
parameters to be set as part of a language tag. For example, the use of the
euro with a simplified Chinese currency format can be specified as "zh-Hans-u-cu-eur"
,
the use of the sort order for German phone books as "de-u-co-phonebk"
.
Multiple parameters can be combined in one language tag, e.g., "de-u-co-phonebk-cu-usd"
for
German with phone book sorting and the U.S. dollar as currency.
The Unicode extension creates two issues:
- The simple fallback used by the Lookup algorithm doesn’t work anymore because the parameter settings are largely independent of each other, not specializations of each other. The parameters have to be interpreted separately from the rest of the language tag.
- Some of the parameters don’t really have anything to do with a language or locale; they’re orthogonal and applications should be able to fully control them (a language tag typically is a user setting). Currencies in particular depend on business requirements and should never be derived from a locale.
The algorithms involved in locale and parameter negotiation solve these issues in two ways:
- Unicode extension subtag sequences are separated from the rest of a language tag. The Lookup algorithm is then applied to the remaining language tag to determine the language to be used, and the parameters set in the Unicode extension subtag sequence are negotiated separately.
- The API distinguishes between three groups of parameters: those that
are related to the locale and are always derived from the language tag,
those that should be fully under application control and are solely obtained
from the options object, and those that can be derived from the language
tag, but also overridden by the application. Tables in the sections on
Collator
,NumberFormat
, andDateTimeFormat
below show for each parameter how it can be set.
The need to decide for each key in the Unicode extension how it should be treated in the API unfortunately means that the specification cannot allow implementations to support newly added keys.
The options object can contain not only properties corresponding to some of the keys of the Unicode extension, but also properties for other parameters that let the application control the behavior of the constructed object. Most parameters have default values (either provided by the specification or locale and implementation dependent), so that for the most common use cases the options object can be omitted.
Applications can get information about the results of locale and parameter
negotiation through the resolvedOptions
accessor property, which
has properties for all parameters. The locale property contains a language
tag with the locale that was selected among the implementation’s available
locales, plus those Unicode extension parameters that were requested and
are supported by the implementation. Applications can also use the supportedLocalesOf
functions
to determine which subset of a list of locales is supported by an implementation,
possibly through fallbacks. There’s no function to get a list of all locales
supported by an implementation as this list could be huge.
Collation
Collator
objects support two usage scenarios: Sorting the strings
in a list, and searching for matching strings in a set of strings. Sorting
generally needs to be sensitive to minor differences in strings, such as
diacritical marks or casing, so that it is clear whether pêche sorts before
or after péché. In searching, on the other hand, such minor differences are
often ignored. (Some languages, however, treat certain characters with diacritical
marks as separate characters, thus considering the marks major differences.)
For some languages, collation may define several sort orders, such as the
dictionary and phone book sort orders for German, which differ in their handling
of umlauted characters. In some applications, punctuation should be ignored
in comparison.
The Unicode extension defines a number of additional parameters. Some of these parameters require special handling:
- The
"co"
key can be used in language tags to select both usage and specific collations. In the Globalization API, these are separated: Usage is specified through the options object, so the values"standard"
and"search"
are ignored if used in language tags. - Implementations are not required to support the parameters
backwards
,caseLevel
,numeric
,hiraganaQuaternary
,normalization
, andcaseFirst
. If they don’t, they still have to check the values of their options properties so that erroneous values result in the same exceptions on all implementations, but theirresolvedOptions
properties do not report back the settings of unsupported parameters. - The
variableTop
parameter defined in the Unicode extension is not supported byCollator
.
Collator
uses the following parameters:
?Parameter | ?Language Tag | ?Options | Values |
---|---|---|---|
locale |
✔ | — | BCP 47 language tag |
usage (collation) |
co — |
✔ | "sort" , "search" |
sensitivity (strength) |
ks — |
✔ | "base" , "accent" , "case" , "variant" |
ignorePunctuation (alternate handling) |
ka — |
✔ | true , false |
collation |
co ✔ |
— | see UTS 35, except "standard ", "search" |
backwards |
kb ✔ |
✔ | true , false |
caseLevel |
kc ✔ |
✔ | true , false |
numeric |
kn ✔ |
✔ | true , false |
hiraganaQuaternary |
kh ✔ |
✔ | true , false |
normalization |
kk ✔ |
✔ | true , false |
caseFirst |
kf ✔ |
✔ | "upper" , "lower" , "false" |
Usage Examples
All examples assume the shortened constructor name:
var Coll = Globalization.Collator;
Sort array a
according to the rules of the host environment’s
current locale:
var collator = new Coll();
a.sort(function (x, y) {
return collator.compare(x, y);
});
Sort array a
according to the rules for German phone books:
var collator = new Coll(["de-u-co-phonebk"]);
a.sort(function (x, y) {
return collator.compare(x, y);
});
Extract those elements from array a
that are similar to string s
according
to the rules of language lang
:
var collator = new Coll([lang], {usage: "search"});
var matches = a.filter(function (v) {
return collator.compare(v, s) === 0;
});
Number Formatting
NumberFormat
objects support three format styles: plain decimal
formatting, currency formatting, and percent formatting. For currency formatting
the currency must be specified by the application because currency use depends
on business requirements that NumberFormat
cannot know about.
Different numbering systems can be used, depending on implementation and
locale: Western (“ASCII”) digits, (real) Arabic digits, Thai digits, Roman
numerals, and more. The number of digits used in representing a number can
be constrained, subject to the same limitations as for Number.prototype.toFixed
and Number.prototype.toPrecision
;
defaults depend on the format style and, in the case of currency formatting,
the currency being used. The use of grouping separators can be disabled.
NumberFormat
uses the following parameters:
?Parameter | ?Language Tag | ?Options | Values |
---|---|---|---|
locale |
✔ | — | BCP 47 language tag |
numberingSystem |
nu ✔ |
— | see UTS 35 |
style |
— | ✔ | "decimal" , "currency" , "percent" |
currency |
cu — |
✔ | ISO 4217 alphabetic code |
currencyDisplay |
— | ✔ | true , false |
minimumIntegerDigits |
— | ✔ | 0..21 |
minimumFractionDigits |
— | ✔ | 0..20 |
maximumFractionDigits |
— | ✔ | 0..20 |
minimumSignificantDigits |
— | ✔ | 1..21 |
maximumSignificantDigits |
— | ✔ | 1..21 |
useGrouping |
— | ✔ | true , false |
Usage Examples
All examples assume the shortened constructor name:
var NF = Globalization.NumberFormat;
Format number n
in decimal style with grouping separators according
to the rules of the host environment’s current locale:
var format = new NF();
var result = format.format(n);
Format number n
in decimal style with grouping separators according
to the rules of language lang
:
var format = new NF([lang]);
var result = format.format(n);
Format number n
in currency style with grouping separators,
with the localized currency symbol for Korean won, but with no fraction digits
(the default for Korean won), according to the rules of language lang
:
var format = new NF([lang], {style: "currency", currency: "KRW"});
var result = format.format(n);
Format number n
in percent style with grouping separators,
with at least 4 significant digits, according to Thai rules and with Thai
digits:
var format = new NF(["th-u-nu-thai"], {style: "percent", minimumSignificantDigits: 4});
var result = format.format(n);
Date and Time Formatting
DateTimeFormat
objects format a time value into a string using
a subset of the following date and time components: weekday, era, year, month,
day, hour, minute, second, and time zone name. Different representations
of these components are available: unconstrained or 2-digit numeric; narrow,
short, or long text. Implementations are required to support at least the
following subsets:
- weekday, year, month, day, hour, minute, second
- weekday, year, month, day
- year, month, day
- year, month
- month, day
- hour, minute, second
- hour, minute
Implementations may support other subsets, and requests will be negotiated
against all available subset-representation combinations to find the best
match. Different numbering systems can be used, as with NumberFormat
.
Hour representations can be forced from a locale-dependent default to 12-hour
or 24-hour format. Implementations may support multiple calendars per locale;
for time zones, they’re limited to UTC and the host environment’s current
time zone.
DateTimeFormat
uses the following parameters:
?Parameter | ?Language Tag | ?Options | Values |
---|---|---|---|
locale |
✔ | — | BCP 47 language tag |
calendar |
ca ✔ |
— | see UTS 35 |
numberingSystem |
nu ✔ |
— | see UTS 35 |
timeZone |
tz — |
✔ | "UTC" |
hour12 |
— | ✔ | true , false |
weekday |
— | ✔ | "narrow" , "short" , "long" |
era |
— | ✔ | "narrow" , "short" , "long" |
year |
— | ✔ | "2-digit" , "numeric" |
month |
— | ✔ | "2-digit" , "numeric" , "narrow" , "short" , "long" |
day |
— | ✔ | "2-digit" , "numeric" |
hour |
— | ✔ | "2-digit" , "numeric" |
minute |
— | ✔ | "2-digit" , "numeric" |
second |
— | ✔ | "2-digit" , "numeric" |
timeZoneName |
— | ✔ | "short" , "long" |
Usage Examples
All examples assume the shortened constructor name:
var DTF = Globalization.DateTimeFormat;
Format the current date with year, month, and day components in numeric format for the host environment’s current locale and time zone:
var format = new DTF();
var result = format.format();
Format time t
with weekday, year, month, and day components
in long format for the host environment’s current time zone according to Thai
conventions with the Thai Buddhist calendar and Thai digits:
var format = new DTF(["th-u-ca-buddhist-nu-thai"], {weekday: "long", year: "long", month: "long", day: "long"});
var result = format.format(t);
Format the current time with hour, minute, and second components in 2-digit format and 24-hour time for the UTC time zone and the host environment’s current locale:
var format = new DTF(undefined, {hour: "2-digit", minute: "2-digit", second: "2-digit", hour12: false, timeZone: "UTC"});
var result = format.format();
Value Errors
The Globalization API accepts strings as specifications of locales and options,
and needs the ability to report invalid values. Edition 5.1 of the ECMAScript
Language Specification doesn’t offer error objects for this situation – SyntaxError
objects
are intended for errors in programming language source text, RangeError
objects
for numeric values, and TypeError
objects for errors in the
type of values or missing properties. The Globalization API specification
therefore proposes a new error constructor ValueError
, with
the expectation that it will eventually migrate into the language specification.