Eliminating the Digital Divide in Java

June 28, 2005

Scott McNealy recently wrote about the community effort needed to eliminate the “digital divide” and will give a related presentation at JavaOne.

Software globalization of course is one of the critical pieces in this effort. A language barrier is a pretty effective divider. If software isn’t capable of rendering text and accepting input in the user’s language(s), it’s not very useful. If the user can’t understand the user interface because it’s in a foreign language, her use of the software will be limited as well. And if the software doesn’t fit into the cultural, legal, and business environment of its intended users, it may not matter how cheap it is.

So I’d like to survey where we stand with Java internationalization and localization, how we enable the community to contribute, and what the remaining issues are.

One acronym you’ll see several times is “SPI,” Service Provider Interface – by this we mean public interfaces in the Java platform APIs that let third-party developers extend the functionality of the Java runtime through new classes (and some identifying information) that are installed into the runtime’s extension directory. SPI’s are one way to enable the community to provide support for additional languages.

Unicode

The foundation of Java globalization is the Unicode character set – the Java platform now supports Unicode 4.0. Unicode has always endeavored to include all languages used on planet Earth, but a number of writing systems still have not been encoded. Encoding a writing system requires detailed knowledge about its use in real life as well as about how it would be processed in software, so it’s often difficult work. Supporting the Script Encoding Initiative may be the best way to help.

Character Encoding Conversion

While the Java runtime uses Unicode (more precisely, UTF-16) internally, much of the world’s data is stored in other character encodings. The JRE already supports a long list of character encodings, but the Java platform also provides an SPI that lets developers add any other encoding that may be needed.

Locale and Currency Identification

The Locale class currently is based on the ISO standard 639-1 for languages and 3166 for countries. ISO 639-1 covers about 200 of the most important languages, but estimates for the total number of languages on planet Earth range in the thousands (many of them near extinction). There’s still plenty of work to do to support just the ISO 639-1 languages, but using the three-character language codes of ISO 639-2 or an extensible standard such as RFC 3066 (or its successor) may eventually be necessary to enable even broader coverage. For countries and currencies, the situation is simpler: The JRE knows all of them.

Date and Time Handling

The Calendar class was intended to enable support for all calendars used in the world, but it turned out that its design was hard to understand, difficult to subclass correctly, and not extensible enough for complete coverage (the Balinese calendars, for example, just don’t fit into the mold). Support for the world’s calendars has therefore been slow in coming: The JRE provides only the Gregorian, Thai Buddhist, and – starting with JRE 6 – Japanese calendars. A complete solution will likely require a new API/SPI combination. In the meantime, the ICU4J library provides a separate, somewhat incompatible Calendar class with several additional calendar implementations.

For time zones, the situation is simpler: The JRE knows all of them. However, there is a little problem with keeping the information up to date: Politicians in some countries like to tinker with the daylight savings rules, often with little advance notice, and so the time zone rules in the JRE don’t always match reality. More frequent updates of the JRE time zone data may be one solution; productizing the tool that we use to update the data may be another.

Names of Languages, Countries, Time Zones, and Currencies

The JRE has traditionally provided complete sets of these names and symbols for about 10 languages, and smaller sets for another 30 or so languages. A new SPI in Java SE 6 enables third parties to provide more. As I mentioned in my blog about this SPI, there’s the idea of creating an extension that uses the SPI to support all locales that the Unicode Consortium’s Common Locale Data Repository provides but the JRE doesn’t. This means that community members who want to extend the set of supported locales may have two ways to contribute: Directly by implementing an extension using the SPI, or indirectly by contributing data to the CLDR.

Text Processing

The text processing functionality in the java.text package supports about 100 locales (more in JRE 6). The discussion of the SPI and CLDR in the previous section applies to this functionality as well.

Text Input

Text input in the Java platform typically relies on the host operating system. For cases where the host OS’s facilities are insufficient, the input method engine SPI can be used to implement both full-fledged input methods and simple keyboard remappers. The JRE itself comes with input methods for Thai and Devanagari, Naoto’s article provides a few more, and several third-party input methods exist.

Text Rendering and Editing

The JRE currently supports 11 of the Unicode standard’s 60 writing systems. Extending this set is where the really hard problems are. Text rendering and editing are complex processes that don’t lend themselves to the creation of SPI’s. They also require extensive testing, which is hard to automate. Testing has been the main bottleneck so far. Some of the currently “supported” writing systems don’t receive any systematic testing. Several additional writing systems (including the ones supported by our Indic input method) are implemented, but we don’t document them as supported because they’re complex and the risk of failures is too high. Community help therefore would be most useful in the area of testing.

User Interface Localization

The JRE’s user interface is currently provided in ten languages, the JDK tools in three. To enable others to provide additional localizations, we’re currently evaluating whether we can document the main user interface resource bundles and how to create and install additional ones. This wouldn’t be the same as an SPI – resource bundles are easier to create, but there would be no guarantee of compatibility from release to release as there is for an SPI.

Documentation and Community Interaction

Sun currently provides documentation about the Java platform primarily in English, with significant amounts translated into Japanese, small amounts into Chinese and Korean, and nothing in other languages. Input from the community is pretty much only accepted if it comes in English. It’s been recognized that these are serious problems, and some efforts are underway to provide more documentation particularly in Chinese. However, much more is needed to engage developers worldwide and forge a global community without language barriers.