Yuccarama! II

Just the other day, we engaged in a glorious eyeglazer of an E-interview with Jukka Korpela on the topic of multilingual Web authoring. (Hot-hot-hot!)

We promised a continuation of the conversation, and by gar here it is.

NUblog: Purely from an issue of character compatibility (making the character you intend show up on absolutely everyone’s browser), what kind of improvements do we need in the technical infrastructure – everything from authoring programs to standards to fonts?

Korpela: I’d say the goal of making characters show up on absolutely everyone’s browser is fundamentally unrealistic; it needs an addition like “...or degrade gracefully.” The variation of browsing situations will increase, and quite a many new situations have fairly limited character display possibilities. The “old” character repertoire of mobile phones is not much larger than ASCII; as protocols and devices are developed, the situation becomes better, but during that process, some other browsing modes will emerge.

This is something that requires attention at the level of application protocols. If you use small images in HTML to present special characters, you can specify a textual alternative to be used when the image is not shown, e.g. <img src="Omega.gif" alt="ohms">, but if you use a notation that denotes a special character itself, like &Omega; or &#937;, then there’s nothing similar you could use. So graceful degradation is possible for a clumsy workaround but for the real solution. This is basically a matter of markup, which means that it’s a tough question.

Otherwise, the standards are mostly not the problem in this area. The process of getting a character displayed is very complicated indeed, consisting of a long chain of things, and standards are perhaps the strongest part of the chain.

Considering the majority of users, the weakest part seems to be the font problem. If Microsoft just decided to include a rich set of fonts, including at least two fairly comprehensive Unicode fonts (monospace font and proportional font), into the normal Windows distributions, into the configuration that gets installed by default, this would probably have greater effect on the situation than any other action I can imagine. I really don’t know why Microsoft doesn’t do that; disk space cannot be the issue these days, and Microsoft already distributes a fairly comprehensive font (Arial Unicode MS) in a manner which is close to free distribution.

What needs to be done in the authoring side is basically the integration of existing technologies into mainstream software. There are several good Unicode editors as far as basic editing is considered, such as UniPad. But such functionality should be part of the usual editors and other programs that people use.

NUblog: How do we explain the general ignorance on the part of English-speaking Web designers, authors, and sites – and the companies behind them – of localization and internationalization? Why are English-speakers generally clueless about the need to create pages in other languages, and why, when they give it a whirl, they do such a bad job?

Korpela: All people are born ignorant, and we all remain ignorant in most matters. People who live in a culture where it is normal to speak one language only, especially if that language happens to be understood to some extent outside that culture too, will have difficulties in getting the big picture. Besides, the people we discuss here are experts in various fields and need to learn a lot of technical things. Understanding the variety of human culture isn’t part of the professionality. One might say that they just haven’t got time to learn languages, except perhaps so-called computer languages, even at a primitive level. In fact, just knowing some other languages, such as French, Spanish, and German, doesn’t really paint the picture; in addition to being structurally Indo-European languages too, they use writing systems very similar to English and can, roughly speaking, be written using ISO 8859-1, which is the de facto default character code on the Web anyway.

In fact, it is not uncommon to see that people outside the English-speaking world don’t understand the “internationalization” issues either. If you know how to deal with problems of using your own native language on the Web, you are probably not very interested in "internationalization.” After all, “internationalization” in this context relates to issues of using national languages! International Web pages, which are mostly written in English, are seldom an “internationalization” problem.

So when the situation comes where English-speaking people need to deal with pages in languages other than English, they typically have to do that without actually knowing much of the other languages. And they tend to underestimate the difficulties rather seriously; they’re looking for quick solutions, and they can’t really judge whether proposed solutions are actually correct. But since they do know how tell a quick solution from a slow one, they pick up something quick; hence all the fontistic hacks, use of PDF format instead of HTML, and the use of Save As Web Page without understanding a bit of what’s going on and will happen.

NUblog: What kind of future is in store for users of very rare or unusual languages on the Web? Think of a few circumstances – minority languages in a country where the language is endangered and ancestral (e.g., Irish); minority languages that are healthy but non-mainstream (e.g., Chinese in Canada, Korean in the U.S.); minority languages that are the national languages of small countries (e.g., Estonian, Turkish). Even the basic authoring tools are often poor for such speakers, unless we include Save As Web Page in Microsoft Word.

Korpela: (I don’t quite understand all the details of the question. I never though Turkish as a small language or Turkey as a small country. But I take the question at a general level, without going much into the examples.)

Well, the future is surely better than the past or the present. The Web as a medium is excellent for publishing material that interests a relatively small but active group of people. In particular, geographic limitations don’t matter much, unless reflected in connectivity. So especially for small languages spoken at different places, the Web could be the only feasible publication medium.

It is true that authoring tools don’t support small languages (or have such support added much later than support to big languages), for obvious reasons, but we can live with that.

Character problems are practically important of course, for quite a many languages. It is not uncommon to see Greek people publish Web pages mainly for Greek people but in English, or even in a home-made transliteration of Greek writing into Latin letters. Chinese students in Western universities will encounter the problem that the university’s computers haven’t even got fonts with Chinese characters, not to mention software tools for typing Chinese. Some very small languages have particular problems with character codes: they might contain characters that do not belong even to Unicode, perhaps because some 19th-century missionary invented a letter for the language. (Actually, most of such letters are combinations of basic Latin letters and some diacritic marks, so that the character can in principle be presented as the base letter followed by a non-spacing mark, and some browsers even have some simple support to this idea.)

More with Korpela eventually, after we think of some further suitable questions. It is rather like interviewing Anthony Burgess at times.

Posted on 2001-07-12