Open Communication Requires Unicode

February 27, 2005

At this week’s Sun Engineering Conference, my contribution was a call to use Unicode everywhere.

What’s new about that? Hasn’t Sun been using Unicode for years? Yes, of course we have, because Sun requires all revenue products to be internationalized, and using Unicode is usually the first step in internationalizing modern software. Java has been based on Unicode since version 1.0, Solaris offers a wide range of UTF-8 locales (UTF-8 is a Unicode character encoding), StarOffice uses Unicode for text processing, and so on.

The problem is in the systems that we use to communicate with customers and other partners but that aren’t considered products. These software tools and web applications often don’t use Unicode and so impose random restrictions on the languages that can be used. And that’s bad because Sun has partners worldwide and needs to speak their languages and listen in their languages.

One negative example is our bug tracking system. This is not a creaky relic from the last millennium, but a brand-new system, developed within Sun, deployed in summer 2004, and using the whole range of modern software technologies. The main front end for Sun-internal use is a Java Web Start-deployed Swing application, which of course lets you input characters in any of the 14 writing systems supported by the Java platform. But if you try to save a bug report that contains, say, Chinese characters, you get this lovely alert:

Alert: One or more characters in the field Note: Description are not in the extended-ASCII character range. Please remove those characters.

The reason is that the back end system has been configured to use the ISO 8859-1 character encoding, which restricts all text to English and a few other western European languages. Text in any other language cannot be stored in any text field. The only workaround is to store it as a binary attachment, which makes it inaccessible to search and difficult to access in general.

As part of opening up its development processes, Sun also offers a front end for public use, the bugs.sun.com web site, which allows anybody to submit and track bug reports against a number of Sun products. Following the lead of the bug database, this web site also uses ISO 8859-1 for all text. In this case, if a user includes non-Latin text in a bug report, she usually won’t even get an alert – the browser will just silently convert the text to question marks or, if we’re lucky, to numeric character references, making it rather difficult for engineers to understand what the bug is about. The web site does not accept or display attachments, so the workaround for the internal front end is incompatible with the public front end.

The reason given for the restriction to ISO 8859-1 is that all Sun employees speak English, and therefore internationalization isn’t necessary. This obviously ignores that a bug tracking system isn’t just about Sun employees communicating with each other; it’s about customers and developers communicating with Sun employees about problems that customers have when processing their data. The data can be in any language that customers use, and so the bug tracking system needs to be able to represent text in any language. Removing “those characters” may make it impossible to investigate and fix a bug.

The language in which customers, developers and Sun employees communicate about bugs is a separate issue from the data. Sun obviously prefers such communication to be in English, because it makes it easier for us to pass the information around within Sun. However, if a customer speaks only Thai and runs into a problem in processing Thai text using Sun products, wouldn’t we prefer that she submit a bug report in Thai rather than immediately switching to a competitor’s product? Most Sun engineers don’t speak Thai, but some do, so we can get help if necessary – if the text survives in the bug tracking system.

The bug tracking system isn’t the only tool that obstructs communication with unnecessary technical restrictions. Other examples include the software behind the Java developer forums, which corrupts non-ASCII text and so makes it difficult for developers to discuss internationalization problems, and the developer feedback forms, which use ISO 8859-1 and so block feedback in non-Latin languages.

Sun has realized that open communication with customers, developers, and other partners is critical to its future success. Limiting the communication to English or a small set of other western European languages means limiting our success worldwide. That’s why using Unicode everywhere is an important first step. The site on which this blog appears, blogs.sun.com, shows the way: It enables daily communication between Sun and the world in Chinese, English, and Japanese; and occasional blogs in French, Hungarian, German, and Korean show that more is possible. All components of blogs.sun.com, of course, use Unicode.