Supplementary Characters for ECMAScript
March 15, 2012
- Terms and Definitions
- Text Interpretation
- Regular Expressions
- Other Text Processing Functions
- Upgrade to Unicode with Supplementary Characters
- Code Point Based String Accessors
- What About new Unicode Escapes?
- What About UTF-32?
ECMAScript, the standard underlying JavaScript, has so far been ambivalent about supporting supplementary characters, those Unicode characters that are outside the 16-bit code space originally envisioned for Unicode and require more than 16 bits for their encoding. The ECMAScript standard specifies UTF-16, the Unicode encoding form designed to extend the original 16-bit encoding to support supplementary characters, as the encoding of source text and strings, and implementations generally allow supplementary characters to be used. Browsers have surrounded ECMAScript implementations with text input, text rendering, DOM APIs, and XMLHttpRequest with full Unicode support, and generally use full UTF-16 to exchange text with their ECMAScript subsystem. Developers have used this to build applications that support supplementary characters. However, some text processing functionality in ECMAScript itself is defined to operate on code units separately, so that they cannot correctly interpret supplementary characters.
This article proposes a set of ECMAScript specification changes that enable correct processing of supplementary characters while maintaining compatibility with existing applications. The basic idea is to keep the existing text representation, but change operations that interpret text to be based on Unicode code points. Supplementary characters are then interpreted as atomic entities with their correct Unicode semantics, just like characters in the Basic Multilingual Plane. This is similar to the design of supplementary character support in Java. The changes are presented in order of priority from an application development point of view.
All changes described are relative to the ECMAScript Language Specification, 5.1 edition.
Terms and Definitions
The ECMAScript Language Specification currently uses the term “character” throughout, but redefines it to actually mean “code unit”. In order to clearly describe a system that is based on 16-bit code units, but supports all Unicode characters, we need to be a bit more precise in our terminology. This proposal relies on the following three definitions, which would replace the definition of “String value” in section 4.3.16, and would largely remove the need to use the overloaded term “character”.
4.3.16 String value: primitive value that is a finite ordered sequence of zero or more code units
NOTE: A String value is a member of the String type. Where ECMAScript operations interpret code units, they are interpreted as UTF-16 code units. However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so it may be ill-formed when interpreted as a UTF-16 code unit sequence.
4.3.17 Code unit: unsigned 16-bit integer. All values from 0 to FFFF16 are allowed. Code unit values are described by the string “0x” followed by four hexadecimal digits.
4.3.18 Code point: unsigned integer value in the range from 0 to 10FFFF16. Code point values are described by the string “U+” followed by four to six hexadecimal digits.
NOTE: This definition follows that of the Unicode standard, so this specification may use “code point” and “Unicode code point” interchangeably.
NOTE: This specification may use the word “character” to refer to specific assigned code points in the Basic Multilingual Plane, where the distinction between code units and code points is irrelevant.
Text Interpretation
In order to support supplementary characters, some ECMAScript operations will have to interpret UTF-16 code unit sequences as code point sequences. This new section describes how. For compatibility with existing applications, it has to allow surrogate code points (code points between U+D800 and U+DFFF which can never represent characters).
5.3 Text Interpretation
Text is represented in ECMAScript as a sequence of UTF-16 code units, but sometimes interpreted as a sequence of code points. Interpretation is as follows:
- A code unit in the range 0 to 0xD7FF or in the range 0xE000 to 0xFFFF is interpreted as a code point with the same value.
- A sequence of two code units, where the first code unit c1 is in the range 0xD800 to 0xDBFF and the second code unit c2 is in the range 0xDC00 to 0xDFFF, is a surrogate pair and is interpreted as a code point with the value (c1 - 0xD800) * 0x400 + (c2 - 0xDC00) + 0x10000.
- A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value.
Regular Expressions
Regular expressions are the most important part of ECMAScript that needs to support supplementary characters. This section updates subclause 15.10 of the ECMAScript Language Specification, to interpret both patterns and the text to be matched as code points. It relies on the definitions and text interpretation provided above.
Benefits of this proposal include:
/^.$/
now matches any Unicode code point; in 5.1, it matches only BMP characters.- Supplementary characters can now be used to define the end points of a range; in 5.1, this results in a SyntaxError.
- If a Quantifier follows an Atom that consists of a supplementary code point, then it now applies to the complete code point, not just its second surrogate code unit.
- Supplementary characters are now treated as entities in a class; in 5.1, the code units representing them become separate class members.
- Case insensitive matching works for supplementary characters.
- An identity escape for a supplementary character now escapes the complete code point, not just its first code unit.
Some libraries, such as this library
to convert some regular expressions with supplementary characters and
this function
to detect the directionality of text, have used unpaired surrogate code
units to work around the current inability of ECMAScript regular expression
to support supplementary characters. For compatibility with such libraries
and applications using them, the proposal includes a transformation of
such workaround regular expressions into equivalent regular expressions using
well-formed supplementary characters. For \uwwww
, \uxxxx
high
surrogates; \uyyyy
, \uzzzz
low surrogates:
\uxxxx[\uyyyy-\uzzzz]
is interpreted as[\uxxxx\uyyyy-\uxxxx\uzzzz]
[\uwwww-\uxxxx][\uyyyy-\uzzzz]
is interpreted as[\uwwww\uyyyy-\uxxxx\uzzzz]
This transformation is rather ugly, but I’m afraid it’s the price ECMAScript has to pay for being 12 years late in supporting supplementary characters. It may be possible to omit this step for regular expression literals or RegExp constructor calls that occur within modules.
Interpreting patterns and input text as sequences of code points may have
a compatibility impact, which we’ll have to monitor and evaluate. For example,
some applications might have processed gunk with regular expressions where
neither the “characters” in the patterns nor the input to be matched are text.
Others might be surprised that s.match(/^.$/)[0].length
can
now be 2. If this turns out to be a problem for existing applications, we may
have to offer regular expressions in separate code unit and code point modes,
with the latter enabled via a "u"
flag.
Details (skip)
Make the following substitutions throughout sections 15.10 and A7:
- SourceCharacter → SourceCodePoint
- PatternCharacter → PatternCodePoint
- CharacterEscape → CodePointEscape
- CharacterClassEscape → CodePointClassEscape
- CharacterClass → CodePointClass
- CharSet → CodePointSet
- CharacterSetMatcher → CodePointSetMatcher
15.10 RegExp (Regular Expression) Objects
Before the period at the end of the first sentence, insert: “, after preprocessing the pattern as described in 15.10.4.1. When matching against the SourceCodePoint non-terminal, it interprets the pattern String as a sequence of code points, as described in 5.3”.
15.10.1 Patterns
At the end of 15.10.1, insert:
SourceCodePoint ::
any code point
15.10.2.1 Notation
Replace the first two bullet items in the first list with:
- Input is the String being matched by the regular expression pattern. Input is interpreted as a sequence of code points, as described in 5.3. The notation Input{n} means the code point starting at position n of Input, where n can range between 0 (inclusive) and InputLength (exclusive).
- InputLength is the number of code units in the Input String.
Replace the first bullet item in the second list with:
- A CodePointSet is a mathematical set of code points.
In the second bullet item in the same list, replace “the index of the last input character” with “the index of the last code unit of the last input code point”.
In the last bullet item in the same list, replace “a character or” with “a code point or”, and “a character ch means that the escape sequence is interpreted as the character ch” with “a code point cp means that the escape sequence is interpreted as the code point cp”.
15.10.2.6 Assertion
Replace “character Input[e–1]” with “the result of calling CodePointEndingAt with argument e-1”.
Replace “character Input[e]” with “code point Input{e}”.
Replace both occurrences of “IsWordChar(e–1)” with “IsWordChar(e, true)”, and both occurrences of “IsWordChar(e)” with “IsWordChar(e, false)”.
In the description of IsWordChar, after “an integer parameter e” insert “and a boolean parameter before”.
Replace the first two steps of the IsWordChar algorithm with:
- If before is true, then decrease e by 1.
- If e == –1 or e == InputLength, return false.
- If before is true, then let c be the result of calling CodePointEndingAt with argument e; else let c be the code point Input{e}.
Add the following abstract operation:
The abstract operation CodePointEndingAt takes an integer parameter e, which must be non-negative and less than InputLength, and performs the following:
- If e > 0 and Input{e-1} > U+FFFF then return Input{e-1}.
- Else return Input{e}.
15.10.2.8 Atom
Replace the first two steps of the algorithm for Atom :: PatternCharacter with:
- Let cp be the code point represented by PatternCodePoint.
- Let A be a one-element CodePointSet containing the code point cp.
Replace the first step of the algorithm for Atom :: . with:
- Let A be the set of all code points except LineTerminator.
In step 3.1.4 of the algorithm for Atom :: ( Disjunction ), replace “characters are the characters” with “code units are the code units”.
Replace steps 1.3 and 1.4 of the algorithm for CodePointSetMatcher with:
- Let cp be the code point Input{e}.
- Let cc be the result of Canonicalize(cp).
Replace step 1.8 of the same algorithm with:
- If cp > 0xFFFF then increment e by 2; else increment e by 1.
- Let y be the State (e, cap).
Replace the abstract operation Canonicalize with:
The abstract operation Canonicalize takes a code point parameter cp and performs the following steps:
- If IgnoreCase is false, return cp.
- If the Unicode Character Database provides a context and language independent upper case equivalent for cp, and that equivalent consists of a single code point, then
- Let cu be that equivalent.
- If cp ≤ 0xFF or cu > 0xFF then return cu.
- Return cp.
In the note at the end of the section, replace “\u00DF” with “U+00DF”, “\u0131” with “"ı" (U+0131)”, “\u017F” with “"ſ" (U+017F)”, both occurrences of “ASCII character” with “code point in the Basic Latin block”, “ASCII letters” with “letters in the Basic Latin block”, and all remaining occurrences of “character” with “code point”.
15.10.2.9 AtomEscape
In the algorithm for AtomEscape :: DecimalEscape, replace step 2 with:
- If E is a code point, then
- Let cp be E’s code point.
- Let A be a one-element CodePointSet containing the code point cp.
- Call CodePointSetMatcher(A, false) and return its Matcher result.
In step 5.8 of the same algorithm, replace “Canonicalize(s[i]) is not the same character as Canonicalize(Input [e+i])” with “Canonicalize(s{i}) is not the same code point as Canonicalize(Input{e+i})”.
Replace the first two steps of the algorithm for AtomEscape :: CodePointEscape with:
- Evaluate CodePointEscape to obtain a code point cp.
- Let A be a one-element CodePointSet containing the code point cp.
15.10.2.10 CodePointEscape
In the first paragraph, replace “character” with “code point”. In the following table, replace “Code Unit” with “Code Point”, and replace all occurrences of “\u” with “U+”.
Replace the algorithm for CodePointEscape :: c ControlLetter with:
- Let cp be the code point represented by ControlLetter.
- Return the remainder of dividing cp by 32.
In the remaining three paragraphs, replace “character” with “code point”.
15.10.2.11 DecimalEscape
In step 2 of the algorithm, replace “a <NUL> character (Unicode value 0000)” with “the code point U+0000 (NULL)”.
15.10.2.12 CharacterClassEscape
Replace all occurrences of “characters” with “code points”.
15.10.2.15 NonemptyClassRanges
In step 4 of the algorithm for NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges, replace “CharacterRange” with “CodePointRange”.
Replace the description of the abstract operation CharacterRange with:
The abstract operation CodePointRange takes two CodePointSet parameters A and B and performs the following:
- If A does not contain exactly one code point or B does not contain exactly one code point then throw a SyntaxError exception.
- Let a be the one code point in CodePointSet A.
- Let b be the one code point in CodePointSet B.
- If a > b then throw a SyntaxError exception.
- Return the set containing all code points from a through b, inclusive.
In note 1, 2, and 3, replace all occurrences of “character” with “code point”.
In note 2, replace “ASCII letters” with “letters in the Basic Latin block”.
15.10.2.17 ClassAtom
Replace “character” with “code point”.
15.10.2.18 ClassAtomNoDash
Replace “character” with “code point”.
15.10.2.19 ClassEscape
Replace steps 2-4 of the algorithm for ClassEscape :: DecimalEscape with:
- If E is not a code point then throw a SyntaxError exception.
- Let cp be E’s code point.
- Return the one-element CodePointSet containing the code point cp.
Replace “the one character <BS> (Unicode value 0008)” with “the one code point U+0008 (BACKSPACE)”.
Replace all remaining occurrences of “character” with “code point”.
15.10.4.1 new RegExp(pattern, flags)
After the first paragraph of section 15.10.4.1, insert the description of two preprocessing steps. The first step removes the need for subsequent processing to worry about Unicode escape sequences while looking for supplementary code points or surrogate code units. The second step provides compatibility with the workarounds that some libraries have used to support supplementary characters in regular expressions.
P is preprocessed as follows:
- Replace all Unicode escape sequences representing code units above 0x00FF with the code units they represent (Unicode escape sequences representing code units up to 0x00FF are left unchanged to avoid confusion with syntactically significant characters).
- If P contains any unpaired surrogate values (a code unit between 0xD800 and 0xDBFF not immediately followed by a code unit between 0xDC00 and 0xDFFF, or a code unit between 0xDC00 and 0xDFFF not immediately preceded by a code unit between 0xD800 and 0xDBFF), then
- Replace any substring of P consisting of a code unit c1 between 0xD800 and 0xDBFF, followed by [, followed by a sequence of code units not including ], followed by ], with a substring omitting c1 before [, but inserting c1 before any code unit between 0xDC00 and 0xDFFF occurring between [ and ].
- Replace any substring of P consisting of [, followed by a code unit c1 between 0xD800 and 0xDBFF, followed by -, followed by a code unit c2 between 0xD800 and 0xDBFF greater than c1, followed by ], followed by [, followed by a code unit c3 between 0xDC00 and 0xDFFF, followed by -, followed by a code unit c4 between 0xDC00 and 0xDFFF greater than c3, followed by ], with a substring formed by [, followed by c1, followed by c3, followed by -, followed by c2, followed by c4, followed by ].
In the following paragraph, replace “the characters of P do” with “P interpreted as a code point sequence does”, and “the characters of P as” with “P interpreted as a code point sequence as”.
Other Text Processing Functions
As a general principle, all functions in ECMAScript that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics. As it turns out, beyond the functions using regular expressions there aren’t many in ECMAScript 5.1 that violate this principle: The String case conversion functions (toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the relational comparison for strings (11.8.5).
For the relational comparison for strings, stability and compatibility may matter more than semantic correctness – to sort strings for display to the user, you want to use the Collator objects of the upcoming ECMAScript Internationalization API anyway. I’m therefore not proposing to change this operation.
Changing the case conversion functions is theoretically also an incompatible change. However, the only affected code points are those in the Deseret block of Unicode, and it’s very unlikely that applications would depend on Deseret characters not being mapped to lower case while calling toLowerCase. Note that Safari is already following the Unicode standard rather than the ECMAScript standard in implementing these functions.
Details (skip)
15.5.4.16 String.prototype.toLowerCase ( )
Before the first sentence, insert: “This method interprets a string as a sequence of code points, as described in 5.3.”
Replace step 3 of the algorithm with:
- Let cpList be a new List.
- For each code point scp in S, from beginning to end, do
- If the Unicode Character Database provides a language insensitive lower case equivalent of scp, then append the code points provided to cpList; otherwise append scp itself.
- Let L be the result of calling String.fromCodePoint, providing the code points in cpList as arguments.
Remove the paragraph following the algorithm.
Remove “in Unicode 2.1.8 and later”.
In the first sentence of Note 1, replace both occurrences of “characters” with “code points”.
Upgrade to Unicode with Supplementary Characters
The previous sections were narrowly focused on functionality needed to process end user text. This section is the omnibus proposal for upgrading ECMAScript to a Unicode version released in this century, adding supplementary character support to identifiers, and removing the redefinition of characters as code units.
The baseline Unicode version is changed to 5.1, which was released on April 2008 and is the version supported by Windows 7 and the .NET framework 4, probably the oldest platforms on which somebody might implement ECMAScript 6 (implementations on older platforms would have to bring their own tables for identifier characters and case conversion). Unicode 5.1 includes supplementary characters, so all restrictions to the Basic Multilingual Plane are removed, along with the one and only reference to UCS-2 in the specification.
Identifiers are specified differently in 5.1 than in previous Unicode versions; the proposed definition of ECMAScript identifiers aligns more closely with the Unicode specification, and allows supplementary characters in identifiers. Unicode escapes in identifiers are handled as a separate preprocessing step to avoid additional complexity in the IdentifierName grammar.
The ECMAScript Language Specification, 5.1 edition, redefines characters as code units, creating endless confusion. This proposal removes most uses of “character” and uses the appropriate choice of “code unit” or “code point” instead.
In order to maintain compatibility with existing ECMAScript, the proposal does not change the representation of source text or String values. There’s more on this topic in the What About UTF-32? section below. Also for compatibility, ill-formed UTF-16 code sequences and surrogate code points are allowed.
Details (skip)
Global Substitutions
Replace any occurrence of the word “character” with “code unit”, and any occurrence of the word “characters” with “code units”, throughout the document, except in the following situations:
- A different change is described in this document.
- The occurrence is in one of the following sections:
- 7.1 Unicode Format-Control Characters
- 7.2 White Space
- 7.3 Line Terminators
- 15.10 RegExp (RegularExpression) Objects
- A.6 Universal Resource Identifier Character Classes
- A.7 Regular Expressions
- The word is used to refer to specific assigned code points in the Basic Multilingual Plane, where the distinction between code units and code points is irrelevant, e.g. “the comma character”.
Make the following substitutions throughout the document except section 15.10:
- SourceCharacter → SourceCodeUnit
- DoubleStringCharacters → DoubleStringCodeUnits
- DoubleStringCharacter → DoubleStringCodeUnit
- SingleStringCharacters → SingleStringCodeUnits
- SingleStringCharacter → SingleStringCodeUnit
- CharacterEscapeSequence → CodeUnitEscapeSequence
- NonEscapeCharacter → NonEscapeCodeUnit
- JSONStringCharacters → JSONStringCodeUnits
- JSONStringCharacter → JSONStringCodeUnit
Replace any occurrence of “\u” within a table under the column heading “code unit value” with “0x”. “\u” is a notation for code unit values in a few places in source text; this shouldn’t be confused with code unit values themselves.
2 Conformance
Replace the second paragraph of clause 2, Conformance, with:
A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 5.1.0 or later and ISO/IEC 10646 with UTF-16 as the adopted encoding form. If the adopted ISO/IEC 10646 subset is not otherwise specified, it is presumed to be the Unicode set, collection 10646.
3 Normative References
Replace the second entry of clause 3, Normative references, with references to Unicode 5.1 and the corresponding ISO 10646 version.
ISO/IEC 10646:2003: Information Technology – Universal Multiple-Octet Coded Character Set (UCS) plus Amendment 1:2005, Amendment 2:2006, Amendment 3:2008, and Amendment 4:2008, plus additional amendments and corrigenda, or successor
The Unicode Standard, Version 5.0, as amended by Unicode 5.1.0, or successor
Unicode Standard Annex #15, Unicode Normalization Forms, version Unicode 5.1.0, or successor
Unicode Standard Annex #31, Unicode Identifiers and Pattern Syntax, version Unicode 5.1.0, or successor.
5.1.2 The Lexical and RegExp Grammars
In the first paragraph, replace “characters (Unicode code units)” with “code units”.
5.1.6 Grammar Notation
Replace the last sentence of the first paragraph with:
All terminal symbol characters specified in this way are to be understood as the UTF-16 code units for the appropriate Unicode characters from the Basic Latin block, as opposed to any similar-looking characters from other Unicode blocks.
In the last paragraph, replace “any Unicode code unit” with “any code unit”.
6 Source Text
Replace clause 6, Source Text, with:
ECMAScript source text is assumed to be a (not necessarily well-formed) sequence of UTF-16 code units for the purposes of this specification. Source text encoded in other character encodings than UTF-16 must be processed as if it was first converted to UTF-16. The text is expected to have been normalised to Unicode Normalization Form C (Canonical Decomposition, followed by Canonical Composition), as described in Unicode Standard Annex #15. Conforming ECMAScript implementations are not required to perform any normalisation of text, or behave as though they were performing normalisation of text, themselves. The UTF-16 code unit sequence will be interpreted as Unicode code points, as described in 5.3, and according to the character properties defined by the Unicode Standard, version 5.1.0 or later.
Syntax
SourceCodeUnit ::
any code unit
In string literals, regular expression literals, and identifiers, any code unit may also be expressed as a Unicode escape sequence consisting of six characters, namely \u plus four hexadecimal digits. Within a string literal or regular expression literal, the Unicode escape sequence contributes one code unit to the value of the literal. Within an identifier, the escape sequence contributes one code unit to the identifier.
NOTE: ECMAScript differs from the Java programming language in the behaviour of Unicode escape sequences. In a Java program, if the Unicode escape sequence \u000A, for example, occurs within a single-line comment, it is interpreted as a line terminator (Unicode character U+000A is line feed) and therefore the next character is not part of the comment. Similarly, if the Unicode escape sequence \u000A occurs within a string literal in a Java program, it is likewise interpreted as a line terminator, which is not allowed within a string literal—one must write \n instead of \u000A to cause a line feed to be part of the string value of a string literal. In an ECMAScript program, a Unicode escape sequence occurring within a comment is never interpreted and therefore cannot contribute to termination of the comment. Similarly, a Unicode escape sequence occurring within a string literal in an ECMAScript program always contributes a character to the String value of the literal and is never interpreted as a line terminator or as a quote mark that might terminate the string literal.
While parsing ECMAScript source code, code unit sequences are commonly interpreted as code point sequences, e.g., to determine whether a code unit sequence can form an identifier. The interpretation is as described in section 5.3.
7.2 White Space
Replace “Unicode 3.0” with “Unicode 5.1.0”.
7.6 Identifier Names and Identifiers
Replace the first three paragraphs of 7.6 with:
Identifier Names are tokens that are interpreted according to the Default Identifier Syntax given in Unicode Standard Annex #31, Identifier and Pattern Syntax, with some small modifications. The Unicode identifier grammar is based on character properties specified by the Unicode Standard. The code points in the specified categories in version 5.1.0 of the Unicode standard must be treated as in those categories by all conforming ECMAScript implementations.
This standard specifies specific character additions: The dollar sign (U+0024) and the underscore (U+005F) are permitted anywhere in an IdentifierName, and the characters zero width non-joiner (U+200C) and zero width joiner (U+200D) can be used after the first code point.
While scanning identifiers, any Unicode escape sequences consisting of the character “\” followed by UnicodeEscapeSequence in the source text are replaced by the CV of the UnicodeEscapeSequence (see 7.8.4). The resulting code unit sequence is then interpreted as a sequence of code points, as described in 5.3.
An Identifier is an IdentifierName that is not a ReservedWord (see 7.6.1).
Replace the fifth paragraph of 7.6 with:
ECMAScript implementations may recognise identifier code points defined in later editions of the Unicode Standard. If portability is a concern, programmers should only employ identifier code points defined in Unicode 5.1.0.
Replace the productions starting with the one for IdentifierStart with:
IdentifierStart ::
UnicodeIDStart
$
_
IdentifierPart ::
UnicodeIDContinue
$
_
<ZWNJ>
<ZWJ>
UnicodeIDStart ::
any code point with the Unicode property “ID_Start”
UnicodeOtherIDContinue ::
any code point with the Unicode property “ID_Continue”
7.8.3 Numeric Literals
Replace the sentence starting with “The source character immediately following...” with “The source code unit sequence immediately following a NumericLiteral must not start with an IdentifierStart or DecimalDigit.”
7.8.4 String Literals
Replace the first paragraph with:
A string literal is zero or more code units enclosed in the code units representing single (0x0027) or double (0x0022) quotes. Each code unit may be represented by an escape sequence. All code units may appear literally in a string literal except for the code units for the closing quote character, backslash (0x005C), carriage return (0x000D), line feed (0x000A), line separator (0x2028), and paragraph separator (0x2029). Any code unit may appear in the form of an escape sequence.
Replace “is a <NUL> character (Unicode value 0000)” with “is the code unit 0x0000, which represents the character NULL”.
Replace “is the character whose code unit value” with “is the code unit whose value”.
8.4 The String Type
In the first paragraph, remove “(see Clause 6)”. There’s nothing relevant to see there.
Replace the second paragraph with:
Where ECMAScript operations interpret String contents, each element is interpreted as a single UTF-16 code unit. However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so they may be ill-formed when interpreted as UTF-16 code unit sequences. Operations that do not interpret String contents treat them as sequences of undifferentiated 16-bit unsigned integers. No operations ensure that Strings are in normalized form. Only operations that are explicitly specified to be language or locale sensitive produce language-sensitive results.
11.8.5 The Abstract Relational Comparison Algorithm
Replace both occurrences of “integer that is the code unit value for the character” with “value of the code unit”. This leaves the algorithm using code unit semantics for compatibility with older versions of ECMAScript. For this kind of algorithm, stability is probably more important than semantic correctness.
15.1.3 URI Handling Function Properties
Replace “in Surrogates, section 3.7, of the Unicode Standard” with “in the description of UTF-16 in section 3.9, Unicode Encoding Forms, of the Unicode Standard”.
15.5.4.5 String.prototype.charCodeAt
Replace both occurrences of “the code unit value of the character” with “the value of the code unit”.
15.12.1.1 The JSON Lexical Grammar
Leave “the characters ‘JSON’” as is.
15.12.3 stringify ( value [ , replacer [ , space ] ] )
Replace step 2.c of the algorithm for the abstract operation Quote with:
- Else if C < 0x0020
Leave the word “character” in step 8.b.iii.1 of the abstract operation JO as is, but fix the preceding word.
In note 3, after “Control characters” insert “in the range U+0000 to U+001F”.
Code Point Based String Accessors
The following proposed functions would make it easier for developers to implement code point based functionality. They mirror the functionality of String.fromCharCode and String.prototype.charCodeAt. It’s unfortunately not possible to extend the existing functions to handle code points: fromCharCode currently coerces larger values into 16 bits by setting all unneeded bits to 0, and callers passing in larger values might be confused if their interpretation changed. charCodeAt is specified to return a value below 0x10000, and callers may not be able to handle larger values.
15.5.3.2 String.fromCodePoint ( [ cp0 [ , cp1 [ , ... ] ] ] )
Returns a String value containing as many code units as necessary to represent the code points given by the arguments. Each argument specifies one code point to be represented in the resulting String, with the first argument specifying the first code point, and so on, from beginning to end. The function behaves as if it were defined as:
String.fromCodePoint = function () {
var chars = [], i;
for (i = 0; i < arguments.length; i++) {
var c = Number(arguments[i]);
if (!isFinite(c) || c < 0 || c > 0x10FFFF || Math.floor(c) !== c) {
throw new RangeError("Invalid code point " + c);
}
if (c < 0x10000) {
chars.push(c);
} else {
c -= 0x10000;
chars.push((c >> 10) + 0xD800);
chars.push((c % 0x400) + 0xDC00);
}
}
return String.fromCharCode.apply(undefined, chars);
};
The length property of the fromCodePoint function is 1.
15.5.4.6 String.prototype.codePointAt (pos)
Returns a Number (a nonnegative integer less than 0x10FFFF) representing the code point value of the code unit sequence starting at position pos in the String resulting from converting this object to a String. If there is no code unit at that position, the result is NaN. The function behaves as if it were defined as:
String.prototype.codePointAt = function (index) {
var str = String(this);
var first = str.charCodeAt(index);
if (first >= 0xD800 && first <= 0xDBFF && str.length > index + 1) {
var second = str.charCodeAt(index + 1);
if (second >= 0xDC00 && second <= 0xDFFF) {
return ((first - 0xD800) << 10) + (second - 0xDC00) + 0x10000;
}
}
return first;
};
NOTE: The codePointAt function is intentionally generic; it does not require that its this value be a String object. Therefore it can be transferred to other kinds of objects for use as a method.
What About new Unicode Escapes?
A new form of Unicode
escapes has been proposed that allows the expression of code points rather
than code units. For example, the character 𠮷 could be written as "\u{20BB7}"
in
addition to the existing forms "𠮷"
and "\uD842\uDFB7"
.
The new form is more readable than the old escape form (although the character
itself of course is best in that respect) and easier to find in Unicode tables.
I don’t consider this feature essential, but it’s nice to have, and can be
integrated with the proposals above without too much difficulty.
Note that ECMAScript cannot add the new escape form to JSON, because JSON is defined by its own specification.
What About UTF-32?
There have been several proposals to change ECMAScript to use Unicode code points directly to represent source code and String values – since code points need at least 21 bits, and computers don’t have 21-bit integers, this effectively means UTF-32. By eliminating the duality code unit / code point, this would make text processing in ECMAScript easier to understand, and it would slightly simplify the work of developers implementing low-level string processing in ECMAScript.
However, this change would also create serious compatibility issues. Take the string “𠮷野家” (Yoshinoya), the name of a Japanese beef bowl restaurant chain. Since the first character is a supplementary character, the length of the string in a current ECMAScript implementation is 4, and the code points start at positions 0, 2, and 3. In a UTF-32 based implementation, the length of the string would be 3, and the code points start at positions 1, 2, 3. The two systems, and libraries and applications running on them, could not easily exchange length or position information about this string. Similarly, positions in ECMAScript strings would no longer match DOM offsets for the same strings, because DOM string offsets (e.g., in the CharacterData interface) are UTF-16 based. Maybe somebody can come up with a solution that solves these issue so that developers don’t have to worry about them, but I haven’t seen it yet (previous discussions on the ECMAScript mailing listed started here and here, continued here).
In addition, code points are in many cases not the right abstraction either for text processing. Many operations have to operate on grapheme clusters, the substrings that a user would perceive as a character.
To really help developers, the focus shouldn’t be on access to individual code points. It should be on more and better functions to process text at higher levels of abstractions. Regular expressions with support for Unicode properties and grapheme clusters would be an excellent start.