Multilingualism amateur hour

Dedicated NUblog readers will be all too familiar with our persistent yammering about localization and multilingualism. It may come as a surprise to hear that this little Weblog is now a leading source for information on the topic. This says nothing about the completeness and breadth of our own offerings and everything about the paucity of knowledge available elsewhere. As with other forms of accessibility and as with the field of translation itself, pretty much everyone is home-schooled, and there is no systematic knowledge of practical, real-world ways to translate a Web site or run a site in multiple languages.

We will not be adding “Developing Website for Users of Languages Other Than English” by Nagia M. Ghanem to the list of credible non-NUblog sources anytime soon. The title itself exemplifies a pitfall of substandard localization: Slightly-off, nonnative phraseology. It goes downhill from there.

This academic paper is all too reminiscent of treatments by amateurs (on any number of our perennial obsessions) who pat themselves on the back for muttering something or other about a particular issue, all the while oblivious to the fact that they don’t know a damned thing about it. It is a form of name-dropping: “Look what I noticed is a problem. Men, don’t let this happen to you!”

You would never put up with this in, say, plumbing or getting your brakes fixed. You may have heard of snakes and know that some pipes are now plastic, and you have probably poured drain cleaner down your sink, but at some point you admit what you don’t know and call in an expert. You may be aware that your car has front disc and rear drum brakes (what are “drum brakes,” exactly?), but you wouldn’t dream of fixing them yourself.

You use your sink and your car every day. You use the Web every day. What makes you think you are an expert on the back-office programming of the Web – particularly for something as obscure as localization?

Example:

The English character set consists of only 26 letters, so using the ASCII code is sufficient to encode the English alphabet or the Latin alphabet in general. However, some languages are ideographic and use unique character to represent each different idea. The number of characters that comprise ideographic languages, several thousands, is far greater than the number of unique eight-bit combinations that can represent these characters. So, to be able to encode web pages in these languages or web pages containing multiple languages, another encoding method than ASCII should be used. This encoding method should have the ability to represent multilingual plain text to overcome the difficulty of exchanging text files internationally.

We suppose this is the CNN® Headline News® version of the character-set issue, but – surprise! – this is the Web and we have unlimited space to expound in any necessary detail, so where’s the tofu?

First, even the English language cannot be written in 26 characters. There are actually 52 letters, ten digits, ten mandatory punctuation marks (comma–period–slash–question mark–colon–semicolon–exclamation point–parentheses–hyphen), two to six quotation marks, two dashes (more in mathematics), and a vast range of punctuation marks which, while rare, are legal in English (brackets–@–#–$–¥–%–&–*–™–®–£–©–¢–§–¶–•). Gerard Unger runs an hilarious slideshow illustrating this point.

Where, then, is the advice on how to solve the problem? It comes later, and boils down to a vague admonition to use Unicode.

Wow. Thanks for all the help.

It’s just one thing after another with this paper.

Amazon is bruited as a well-designed international interface. As if!
“To display these different text directions, a method should be used to specify the direction. This may be in the form of a HTML tag or style.” OK. Which HTML tag? What “style”? (A skimpy explanation comes up later. But nothing is simple in right-to-left text.)
“It is a very common misconception that if one can simply map one-to-one from a character code to a glyph image, nothing more is required. This is not true: in some languages, a single character may have multiple glyph images even within the roman languages.” Um, for the punters out there, what is a glyph? It’s a specific visual representation of a character. In English, fi is usually two glyphs, or one if you use a ligature. Greek lowercase sigma has two glyphs, one of them (ς) restricted to the ends of words. If we can sum it up compactly, why can’t Ghanem?

In a pro forma concession to fairness, we note that the advice is not all bad. The following is reasonable: “English is a very compact language, and text almost inevitably expands in translation. Running text may expand by around 30% on average in European languages. Short labels or single words, however, can easily expand by 200%–300%. Designers need to consider, early in the design phase, how they will deal with text expansion.” Ghanem offers some techniques to fix that problem.

Yet we feel all sticky with filth after reading this treatise. It’s as if we had given inadvertent credence to some “analyst” who spent a week of billable hours surfing the Web on a specific topic and proudly presented her (sic) half-arsed research, making her a de facto expert in the eyes of gullible clients and reporters.

And is the piece entirely original? Ghanem cites but does not quote a much more colloquial paper by Andrew Cunningham that succeeds as a result of its modest “Here’s how you can learn from our experience” style. Compare and contrast!

Ghanem: “Restrict font solutions to those available for free from the Internet. If you use a solution that requires your target audience to obtain commercial software, then it is not likely that your Web site will be used.” (Bumbling malaproprisms continue from there.)
Cunningham: “On community sites I’d restrict myself to font solutions that are available for free from the Internet. If you use a solution that requires your target audience to obtain commercial software, then its not likely that your Web site will be used.”

Coincidence... or Chariots of the Gods?

When it comes to multilingual Web content, we at NUblog are proud to stand head and shoulders above dreck like this. Arrogant of us? Yes. (“Self-important,” shurely?!) But all the evidence shows we can walk the walk.

Posted on 2001-05-31