Norbert’s Corner

Supplementary Characters for ECMAScript

March 15, 2012

ECMAScript, the standard underlying JavaScript, has so far been ambivalent about supporting supplementary characters, those Unicode characters that are outside the 16-bit code space originally envisioned for Unicode and require more than 16 bits for their encoding. The ECMAScript standard specifies UTF-16, the Unicode encoding form designed to extend the original 16-bit encoding to support supplementary characters, as the encoding of source text and strings, and implementations generally allow supplementary characters to be used. Browsers have surrounded ECMAScript implementations with text input, text rendering, DOM APIs, and XMLHttpRequest with full Unicode support, and generally use full UTF-16 to exchange text with their ECMAScript subsystem. Developers have used this to build applications that support supplementary characters. However, some text processing functionality in ECMAScript itself is defined to operate on code units separately, so that they cannot correctly interpret supplementary characters.

This article proposes a set of ECMAScript specification changes that enable correct processing of supplementary characters while maintaining compatibility with existing applications. The basic idea is to keep the existing text representation, but change operations that interpret text to be based on Unicode code points. Supplementary characters are then interpreted as atomic entities with their correct Unicode semantics, just like characters in the Basic Multilingual Plane. This is similar to the design of supplementary character support in Java. The changes are presented in order of priority from an application development point of view.

All changes described are relative to the ECMAScript Language Specification, 5.1 edition.

Terms and Definitions

The ECMAScript Language Specification currently uses the term “character” throughout, but redefines it to actually mean “code unit”. In order to clearly describe a system that is based on 16-bit code units, but supports all Unicode characters, we need to be a bit more precise in our terminology. This proposal relies on the following three definitions, which would replace the definition of “String value” in section 4.3.16, and would largely remove the need to use the overloaded term “character”.

4.3.16 String value: primitive value that is a finite ordered sequence of zero or more code units

NOTE: A String value is a member of the String type. Where ECMAScript operations interpret code units, they are interpreted as UTF-16 code units. However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence.

4.3.17 Code unit: unsigned 16-bit integer. All values from 0 to FFFF16 are allowed. Code unit values are described by the string “0x” followed by four hexadecimal digits.

4.3.18 Code point: unsigned integer value in the range from 0 to 10FFFF16. Code point values are described by the string “U+” followed by four to six hexadecimal digits.

NOTE: This definition follows that of the Unicode standard, so this specification may use “code point” and “Unicode code point” interchangeably.

NOTE: This specification may use the word “character” to refer to specific assigned code points in the Basic Multilingual Plane, where the distinction between code units and code points is irrelevant.

Text Interpretation

In order to support supplementary characters, some ECMAScript operations will have to interpret UTF-16 code unit sequences as code point sequences. This new section describes how. For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters).

5.3 Text Interpretation

Text is represented in ECMAScript as a sequence of UTF-16 code units, but sometimes interpreted as a sequence of code points. Interpretation is as follows:

Regular Expressions

Regular expressions are the most important part of ECMAScript that needs to support supplementary characters. This section updates subclause 15.10 of the ECMAScript Language Specification, to interpret both patterns and the text to be matched as code points. It relies on the definitions and text interpretation provided above.

Benefits of this proposal include:

Some libraries, such as this library to convert some regular expressions with supplementary characters and this function to detect the directionality of text, have used unpaired surrogate code units to work around the current inability of ECMAScript regular expression to support supplementary characters. For compatibility with such libraries and applications using them, the proposal includes a transformation of such workaround regular expressions into equivalent regular expressions using well-formed supplementary characters. For \uwwww, \uxxxx high surrogates; \uyyyy, \uzzzz low surrogates:

This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters. It may be possible to omit this step for regular expression literals or RegExp constructor calls that occur within modules.

Interpreting patterns and input text as sequences of code points may have a compatibility impact, which we’ll have to monitor and evaluate. For example, some applications might have processed gunk with regular expressions where neither the “characters” in the patterns nor the input to be matched are text. Others might be surprised that s.match(/^.$/)[0].length can now be 2. If this turns out to be a problem for existing applications, we may have to offer regular expressions in separate code unit and code point modes, with the latter enabled via a "u" flag.

Details (skip)

Make the following substitutions throughout sections 15.10 and A7:

15.10 RegExp (Regular Expression) Objects

Before the period at the end of the first sentence, insert: “, after preprocessing the pattern as described in 15.10.4.1. When matching against the SourceCodePoint non-terminal, it interprets the pattern String as a sequence of code points, as described in 5.3”.

15.10.1 Patterns

At the end of 15.10.1, insert:

SourceCodePoint ::

any code point

15.10.2.1 Notation

Replace the first two bullet items in the first list with:

Replace the first bullet item in the second list with:

In the second bullet item in the same list, replace “the index of the last input character” with “the index of the last code unit of the last input code point”.

In the last bullet item in the same list, replace “a character or” with “a code point or”, and “a character ch means that the escape sequence is interpreted as the character ch” with “a code point cp means that the escape sequence is interpreted as the code point cp”.

15.10.2.6 Assertion

Replace “character Input[e–1]” with “the result of calling CodePointEndingAt with argument e-1”.

Replace “character Input[e]” with “code point Input{e}”.

Replace both occurrences of “IsWordChar(e–1)” with “IsWordChar(e, true)”, and both occurrences of “IsWordChar(e)” with “IsWordChar(e, false)”.

In the description of IsWordChar, after “an integer parameter e” insert “and a boolean parameter before”.

Replace the first two steps of the IsWordChar algorithm with:

Add the following abstract operation:

The abstract operation CodePointEndingAt takes an integer parameter e, which must be non-negative and less than InputLength, and performs the following:

15.10.2.8 Atom

Replace the first two steps of the algorithm for Atom :: PatternCharacter with:

Replace the first step of the algorithm for Atom :: . with:

In step 3.1.4 of the algorithm for Atom :: ( Disjunction ), replace “characters are the characters” with “code units are the code units”.

Replace steps 1.3 and 1.4 of the algorithm for CodePointSetMatcher with:

Replace step 1.8 of the same algorithm with:

Replace the abstract operation Canonicalize with:

The abstract operation Canonicalize takes a code point parameter cp and performs the following steps:

In the note at the end of the section, replace “\u00DF” with “U+00DF”, “\u0131” with “"ı" (U+0131)”, “\u017F” with “"ſ" (U+017F)”, both occurrences of “ASCII character” with “code point in the Basic Latin block”, “ASCII letters” with “letters in the Basic Latin block”, and all remaining occurrences of “character” with “code point”.

15.10.2.9 AtomEscape

In the algorithm for AtomEscape :: DecimalEscape, replace step 2 with:

In step 5.8 of the same algorithm, replace “Canonicalize(s[i]) is not the same character as Canonicalize(Input [e+i])” with “Canonicalize(s{i}) is not the same code point as Canonicalize(Input{e+i})”.

Replace the first two steps of the algorithm for AtomEscape :: CodePointEscape with:

15.10.2.10 CodePointEscape

In the first paragraph, replace “character” with “code point”. In the following table, replace “Code Unit” with “Code Point”, and replace all occurrences of “\u” with “U+”.

Replace the algorithm for CodePointEscape :: c ControlLetter with:

In the remaining three paragraphs, replace “character” with “code point”.

15.10.2.11 DecimalEscape

In step 2 of the algorithm, replace “a <NUL> character (Unicode value 0000)” with “the code point U+0000 (NULL)”.

15.10.2.12 CharacterClassEscape

Replace all occurrences of “characters” with “code points”.

15.10.2.15 NonemptyClassRanges

In step 4 of the algorithm for NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges, replace “CharacterRange” with “CodePointRange”.

Replace the description of the abstract operation CharacterRange with:

The abstract operation CodePointRange takes two CodePointSet parameters A and B and performs the following:

In note 1, 2, and 3, replace all occurrences of “character” with “code point”.

In note 2, replace “ASCII letters” with “letters in the Basic Latin block”.

15.10.2.17 ClassAtom

Replace “character” with “code point”.

15.10.2.18 ClassAtomNoDash

Replace “character” with “code point”.

15.10.2.19 ClassEscape

Replace steps 2-4 of the algorithm for ClassEscape :: DecimalEscape with:

Replace “the one character <BS> (Unicode value 0008)” with “the one code point U+0008 (BACKSPACE)”.

Replace all remaining occurrences of “character” with “code point”.

15.10.4.1 new RegExp(pattern, flags)

After the first paragraph of section 15.10.4.1, insert the description of two preprocessing steps. The first step removes the need for subsequent processing to worry about Unicode escape sequences while looking for supplementary code points or surrogate code units. The second step provides compatibility with the workarounds that some libraries have used to support supplementary characters in regular expressions.

P is preprocessed as follows:

In the following paragraph, replace “the characters of P do” with “P interpreted as a code point sequence does”, and “the characters of P as” with “P interpreted as a code point sequence as”.

Other Text Processing Functions

As a general principle, all functions in ECMAScript that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics. As it turns out, beyond the functions using regular expressions there aren’t many in ECMAScript 5.1 that violate this principle: The String case conversion functions (toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the relational comparison for strings (11.8.5).

For the relational comparison for strings, stability and compatibility may matter more than semantic correctness – to sort strings for display to the user, you want to use the Collator objects of the upcoming ECMAScript Internationalization API anyway. I’m therefore not proposing to change this operation.

Changing the case conversion functions is theoretically also an incompatible change. However, the only affected code points are those in the Deseret block of Unicode, and it’s very unlikely that applications would depend on Deseret characters not being mapped to lower case while calling toLowerCase. Note that Safari is already following the Unicode standard rather than the ECMAScript standard in implementing these functions.

Details (skip)

15.5.4.16 String.prototype.toLowerCase ( )

Before the first sentence, insert: “This method interprets a string as a sequence of code points, as described in 5.3.”

Replace step 3 of the algorithm with:

Remove the paragraph following the algorithm.

Remove “in Unicode 2.1.8 and later”.

In the first sentence of Note 1, replace both occurrences of “characters” with “code points”.

Upgrade to Unicode with Supplementary Characters

The previous sections were narrowly focused on functionality needed to process end user text. This section is the omnibus proposal for upgrading ECMAScript to a Unicode version released in this century, adding supplementary character support to identifiers, and removing the redefinition of characters as code units.

The baseline Unicode version is changed to 5.1, which was released on April 2008 and is the version supported by Windows 7 and the .NET framework 4, probably the oldest platforms on which somebody might implement ECMAScript 6 (implementations on older platforms would have to bring their own tables for identifier characters and case conversion). Unicode 5.1 includes supplementary characters, so all restrictions to the Basic Multilingual Plane are removed, along with the one and only reference to UCS-2 in the specification.

Identifiers are specified differently in 5.1 than in previous Unicode versions; the proposed definition of ECMAScript identifiers aligns more closely with the Unicode specification, and allows supplementary characters in identifiers. Unicode escapes in identifiers are handled as a separate preprocessing step to avoid additional complexity in the IdentifierName grammar.

The ECMAScript Language Specification, 5.1 edition, redefines characters as code units, creating endless confusion. This proposal removes most uses of “character” and uses the appropriate choice of “code unit” or “code point” instead.

In order to maintain compatibility with existing ECMAScript, the proposal does not change the representation of source text or String values. There’s more on this topic in the What About UTF-32? section below. Also for compatibility, ill-formed UTF-16 code sequences and surrogate code points are allowed.

Details (skip)

Global Substitutions

Replace any occurrence of the word “character” with “code unit”, and any occurrence of the word “characters” with “code units”, throughout the document, except in the following situations:

Make the following substitutions throughout the document except section 15.10:

Replace any occurrence of “\u” within a table under the column heading “code unit value” with “0x”. “\u” is a notation for code unit values in a few places in source text; this shouldn’t be confused with code unit values themselves.

2 Conformance

Replace the second paragraph of clause 2, Conformance, with:

A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 5.1.0 or later and ISO/IEC 10646 with UTF-16 as the adopted encoding form. If the adopted ISO/IEC 10646 subset is not otherwise specified, it is presumed to be the Unicode set, collection 10646.

3 Normative References

Replace the second entry of clause 3, Normative references, with references to Unicode 5.1 and the corresponding ISO 10646 version.

ISO/IEC 10646:2003: Information Technology – Universal Multiple-Octet Coded Character Set (UCS) plus Amendment 1:2005, Amendment 2:2006, Amendment 3:2008, and Amendment 4:2008, plus additional amendments and corrigenda, or successor

The Unicode Standard, Version 5.0, as amended by Unicode 5.1.0, or successor

Unicode Standard Annex #15, Unicode Normalization Forms, version Unicode 5.1.0, or successor

Unicode Standard Annex #31, Unicode Identifiers and Pattern Syntax, version Unicode 5.1.0, or successor.

5.1.2 The Lexical and RegExp Grammars

In the first paragraph, replace “characters (Unicode code units)” with “code units”.

5.1.6 Grammar Notation

Replace the last sentence of the first paragraph with:

All terminal symbol characters specified in this way are to be understood as the UTF-16 code units for the appropriate Unicode characters from the Basic Latin block, as opposed to any similar-looking characters from other Unicode blocks.

In the last paragraph, replace “any Unicode code unit” with “any code unit”.

6 Source Text

Replace clause 6, Source Text, with:

ECMAScript source text is assumed to be a (not necessarily well-formed) sequence of UTF-16 code units for the purposes of this specification. Source text encoded in other character encodings than UTF-16 must be processed as if it was first converted to UTF-16. The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15. Conforming ECMAScript implementations are not required to perform any normalisation of text, or behave as though they were performing normalisation of text, themselves. The UTF-16 code unit sequence will be interpreted as Unicode code points, as described in 5.3, and according to the character properties defined by the Unicode Standard, version 5.1.0 or later.

Syntax

SourceCodeUnit ::

any code unit

In string literals, regular expression literals, and identifiers, any code unit may also be expressed as a Unicode escape sequence consisting of six characters, namely \u plus four hexadecimal digits. Within a string literal or regular expression literal, the Unicode escape sequence contributes one code unit to the value of the literal. Within an identifier, the escape sequence contributes one code unit to the identifier.

NOTE: ECMAScript differs from the Java programming language in the behaviour of Unicode escape sequences. In a Java program, if the Unicode escape sequence \u000A, for example, occurs within a single-line comment, it is interpreted as a line terminator (Unicode character U+000A is line feed) and therefore the next character is not part of the comment. Similarly, if the Unicode escape sequence \u000A occurs within a string literal in a Java program, it is likewise interpreted as a line terminator, which is not allowed within a string literal—one must write \n instead of \u000A to cause a line feed to be part of the string value of a string literal. In an ECMAScript program, a Unicode escape sequence occurring within a comment is never interpreted and therefore cannot contribute to termination of the comment. Similarly, a Unicode escape sequence occurring within a string literal in an ECMAScript program always contributes a character to the String value of the literal and is never interpreted as a line terminator or as a quote mark that might terminate the string literal.

While parsing ECMAScript source code, code unit sequences are commonly interpreted as code point sequences, e.g., to determine whether a code unit sequence can form an identifier. The interpretation is as described in section 5.3.

7.2 White Space

Replace “Unicode 3.0” with “Unicode 5.1.0”.

7.6 Identifier Names and Identifiers

Replace the first three paragraphs of 7.6 with:

Identifier Names are tokens that are interpreted according to the Default Identifier Syntax given in Unicode Standard Annex #31, Identifier and Pattern Syntax, with some small modifications. The Unicode identifier grammar is based on character properties specified by the Unicode Standard. The code points in the specified categories in version 5.1.0 of the Unicode standard must be treated as in those categories by all conforming ECMAScript implementations.

This standard specifies specific character additions: The dollar sign (U+0024) and the underscore (U+005F) are permitted anywhere in an IdentifierName, and the characters zero width non-joiner (U+200C) and zero width joiner (U+200D) can be used after the first code point.

While scanning identifiers, any Unicode escape sequences consisting of the character “\” followed by UnicodeEscapeSequence in the source text are replaced by the CV of the UnicodeEscapeSequence (see 7.8.4). The resulting code unit sequence is then interpreted as a sequence of code points, as described in 5.3.

An Identifier is an IdentifierName that is not a ReservedWord (see 7.6.1).

Replace the fifth paragraph of 7.6 with:

ECMAScript implementations may recognise identifier code points defined in later editions of the Unicode Standard. If portability is a concern, programmers should only employ identifier code points defined in Unicode 5.1.0.

Replace the productions starting with the one for IdentifierStart with:

IdentifierStart ::

UnicodeIDStart

$

_

 

IdentifierPart ::

UnicodeIDContinue

$

_

<ZWNJ>

<ZWJ>

 

UnicodeIDStart ::

any code point with the Unicode property “ID_Start”

 

UnicodeOtherIDContinue ::

any code point with the Unicode property “ID_Continue”

7.8.3 Numeric Literals

Replace the sentence starting with “The source character immediately following...” with “The source code unit sequence immediately following a NumericLiteral must not start with an IdentifierStart or DecimalDigit.”

7.8.4 String Literals

Replace the first paragraph with:

A string literal is zero or more code units enclosed in the code units representing single (0x0027) or double (0x0022) quotes. Each code unit may be represented by an escape sequence. All code units may appear literally in a string literal except for the code units for the closing quote character, backslash (0x005C), carriage return (0x000D), line feed (0x000A), line separator (0x2028), and paragraph separator (0x2029). Any code unit may appear in the form of an escape sequence.

Replace “is a <NUL> character (Unicode value 0000)” with “is the code unit 0x0000, which represents the character NULL”.

Replace “is the character whose code unit value” with “is the code unit whose value”.

8.4 The String Type

In the first paragraph, remove “(see Clause 6)”. There’s nothing relevant to see there.

Replace the second paragraph with:

Where ECMAScript operations interpret String contents, each element is interpreted as a single UTF-16 code unit. However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so they may be ill-formed when interpreted as UTF-16 code unit sequences. Operations that do not interpret String contents treat them as sequences of undifferentiated 16-bit unsigned integers. No operations ensure that Strings are in normalized form. Only operations that are explicitly specified to be language or locale sensitive produce language-sensitive results.

11.8.5 The Abstract Relational Comparison Algorithm

Replace both occurrences of “integer that is the code unit value for the character” with “value of the code unit”. This leaves the algorithm using code unit semantics for compatibility with older versions of ECMAScript. For this kind of algorithm, stability is probably more important than semantic correctness.

15.1.3 URI Handling Function Properties

Replace “in Surrogates, section 3.7, of the Unicode Standard” with “in the description of UTF-16 in section 3.9, Unicode Encoding Forms, of the Unicode Standard”.

15.5.4.5 String.prototype.charCodeAt

Replace both occurrences of “the code unit value of the character” with “the value of the code unit”.

15.12.1.1 The JSON Lexical Grammar

Leave “the characters ‘JSON’” as is.

15.12.3 stringify ( value [ , replacer [ , space ] ] )

Replace step 2.c of the algorithm for the abstract operation Quote with:

Leave the word “character” in step 8.b.iii.1 of the abstract operation JO as is, but fix the preceding word.

In note 3, after “Control characters” insert “in the range U+0000 to U+001F”.

Code Point Based String Accessors

The following proposed functions would make it easier for developers to implement code point based functionality. They mirror the functionality of String.fromCharCode and String.prototype.charCodeAt. It’s unfortunately not possible to extend the existing functions to handle code points: fromCharCode currently coerces larger values into 16 bits by setting all unneeded bits to 0, and callers passing in larger values might be confused if their interpretation changed. charCodeAt is specified to return a value below 0x10000, and callers may not be able to handle larger values.

15.5.3.2 String.fromCodePoint ( [ cp0 [ , cp1 [ , ... ] ] ] )

Returns a String value containing as many code units as necessary to represent the code points given by the arguments. Each argument specifies one code point to be represented in the resulting String, with the first argument specifying the first code point, and so on, from beginning to end. The function behaves as if it were defined as:

String​.fromCodePoint = function () {

var chars = [], i;

for (i = 0; i < arguments​.length; i++) {

var c = Number​(arguments[i]);

if (!isFinite​(c) || c < 0 || c > 0x10FFFF || Math​.floor​(c) !== c) {

throw new RangeError​("Invalid code point " + c);

}

if (c < 0x10000) {

chars​.push​(c);

} else {

c -= 0x10000;

chars​.push​((c >> 10) + 0xD800);

chars​.push​((c % 0x400) + 0xDC00);

}

}

return String​.fromCharCode​.apply​(undefined, chars);

};

The length property of the fromCodePoint function is 1.

15.5.4.6 String.prototype.codePointAt (pos)

Returns a Number (a nonnegative integer less than 0x10FFFF) representing the code point value of the code unit sequence starting at position pos in the String resulting from converting this object to a String. If there is no code unit at that position, the result is NaN. The function behaves as if it were defined as:

String​.prototype​.codePointAt = function (index) {

var str = String​(this);

var first = str​.charCodeAt​(index);

if (first >= 0xD800 && first <= 0xDBFF && str​.length > index + 1) {

var second = str​.charCodeAt​(index + 1);

if (second >= 0xDC00 && second <= 0xDFFF) {

return ((first - 0xD800) << 10) + (second - 0xDC00) + 0x10000;

}

}

return first;

};

NOTE: The codePointAt function is intentionally generic; it does not require that its this value be a String object. Therefore it can be transferred to other kinds of objects for use as a method.

What About new Unicode Escapes?

A new form of Unicode escapes has been proposed that allows the expression of code points rather than code units. For example, the character 𠮷 could be written as "\u{20BB7}" in addition to the existing forms "𠮷" and "\uD842\uDFB7". The new form is more readable than the old escape form (although the character itself of course is best in that respect) and easier to find in Unicode tables. I don’t consider this feature essential, but it’s nice to have, and can be integrated with the proposals above without too much difficulty.

Note that ECMAScript cannot add the new escape form to JSON, because JSON is defined by its own specification.

What About UTF-32?

There have been several proposals to change ECMAScript to use Unicode code points directly to represent source code and String values – since code points need at least 21 bits, and computers don’t have 21-bit integers, this effectively means UTF-32. By eliminating the duality code unit / code point, this would make text processing in ECMAScript easier to understand, and it would slightly simplify the work of developers implementing low-level string processing in ECMAScript.

However, this change would also create serious compatibility issues. Take the string “𠮷野家” (Yoshinoya), the name of a Japanese beef bowl restaurant chain. Since the first character is a supplementary character, the length of the string in a current ECMAScript implementation is 4, and the code points start at positions 0, 2, and 3. In a UTF-32 based implementation, the length of the string would be 3, and the code points start at positions 1, 2, 3. The two systems, and libraries and applications running on them, could not easily exchange length or position information about this string. Similarly, positions in ECMAScript strings would no longer match DOM offsets for the same strings, because DOM string offsets (e.g., in the CharacterData interface) are UTF-16 based. Maybe somebody can come up with a solution that solves these issue so that developers don’t have to worry about them, but I haven’t seen it yet (previous discussions on the ECMAScript mailing listed started here and here, continued here).

In addition, code points are in many cases not the right abstraction either for text processing. Many operations have to operate on grapheme clusters, the substrings that a user would perceive as a character.

To really help developers, the focus shouldn’t be on access to individual code points. It should be on more and better functions to process text at higher levels of abstractions. Regular expressions with support for Unicode properties and grapheme clusters would be an excellent start.