Joe Clark: Accessibility ¶ Design ¶ Writing

Borked Unicode: Tips for journalists on writing clean copy

Good journalists file clean copy. But what does that mean? This article gives journalists the basics they need to know to ensure that every character, word, sentence, and paragraph they intended to write gets correctly saved and reproduced on computer systems, and ultimately online and in print. You’ll learn enough to avoid the dreaded borked Unicode.

Writing clean copy

Clean copy is an established concept in print journalism. Conservatively, the term refers to text that is spelled and punctuated correctly and makes sense. As an editor who was formative in my development, the late Sid Adilman of the Toronto Star, put it, the goal is to write an article that “reads well.” Clean copy is a prerequisite for that.

But I want to expand the definition to encompass character encoding. Your copy can’t be considered “clean” unless and until it is stored and reproduced correctly. Getting character encoding right is an absolute necessity for working print journalists, which is all well and good except for the fact that nobody has ever bothered to teach journalists what character encoding is.

What you’re going to learn

By the time you’re done reading this article, you’ll gain knowledge of character encoding that leads to confidence that you can write clean copy and muscle memory and good habits to actually do it.


Basic facts

The concepts involved in producing clean copy can extend way out to the horizon, but journalists don’t need to worry about expert-level details. Here is the complete list of facts you need to know.

There. Now you’re up to speed. For the working hack, it really is that simple.

Just type the character

The most important advice is the easiest: Just type the character you want. You may need to learn how to type it, but I’m going to teach you how. You may need to copy and paste it from another source. But the point is use the character you want right in your document. And do that everywhere – hed, dek, body copy, in RSS, on Twitter.

There are rare exceptions to this rule. When the character your system displays can be confused with something else or is simply invisible, as in the case of whitespace or non-breaking hyphen (see below), you need to enter a character entity, which uses a sequence of other characters to escape the character you actually want. In these cases, you’re specifying the character by an agreed-upon name or by its Unicode number. You do that by starting the name or number with an ampersand and ending it with a semicolon. Some examples, purely for illustration purposes:

You have to know this troublesome implementation detail because it is the only way to reliably enter and edit the few characters that demand this approach. Absolutely do not use this method to enter what you think are “special characters” in the day-to-day run of your work as writer or editor.

What can go wrong?

I want to make sure you know what I’m talking about, so I’m going to show you a few errors of character encoding. After you finish this article, you will be in a position to avoid borked Unicode like this for the rest of your career.

If you need a poster child for character encoding, this is it

In the English language, the giveaway character that can conclusively prove your copy is dirty is this: (opening single quotation mark). Why?

If you can’t get opening single quotation mark right, you probably can’t get anything right that isn’t a nice easy letter or a number.

Start using a good editor

You can bang out copy in Microsoft Word if you want. With modern versions of MS Word, that copy will always use Unicode. Problem solved? No, because Word is borderline useless when it comes to fixing someone else’s copy.

At a minimum, you need to be able to do all of the following:

You can gin up macros to do most of these things in MS Word, but you have better things to do.

Maybe you associate writing on a computer with “word processing,” and associate word processing with the market leader, MS Word. But here I am strongly recommending you update your workflow to use a real text editor. You may not even know this category of software exists, but it does, and it is mature and can do everything you need.

Unreadable onscreen type leads to copy errors. Always use nice big fonts (16 pixels minimum in typical cases), and unless you know you need to, don’t use monospaced fonts, especially not Courier.

What characters do people get wrong?

Apostrophes and quotation marks

Neutral versions of either of those (' and ") do not cut it anywhere, at any time, outside of programming and markup.

When consecutive quotes follow each other, or when an apostrophe sits inside a quotation mark, you have to use at least a thin space to separate them, as described below.

Accents and diacritics

By consensus, we don’t have to write foreign words in their own script if it isn’t Latin script. We don’t have to write Москва for Moscow and 日本 for Japan. Foreign proper nouns with accepted English spellings, like Cologne (for Köln, Germany), don’t have to be changed.

But one source of unclean copy is the belief that a word that contains accents or diacritics is weird and foreign and that the accents are optional. Wrong. Accents or diacritics are intrinsic to correct spelling. Just as cant and wont are words that differ from can’t and won’t, resume and résumé are two different words.

In case I’m not making myself clear, leaving out accents means you’ve misspelled the word.

To write clean copy, you have to know what accents are called. You can’t use fake French names for accents; all of them already have English names. You cannot use vague, impressionistic, guaranteed-to-be-misunderstood descriptors like “the one that goes from left to right.”

Why learn the right terminology? You can’t work in ignorance, first of all, but more importantly, you have to be able to correct someone’s copy, or instruct someone on how to enter text correctly, over the phone, in person, or via chat – anywhere you can’t actually draw a character on a printout. (“No, the first letter is cap E acute. You sent me grave accent. Fix it.”)

Important single letters from other languages:

That isn’t the full list of diacritics or letters, but the point is you must be able to name these letters and accents on sight, cold, without a cheatsheet. Do you think they’ll never come up in the copy you write? Well, maybe they won’t, but they’ll come up in copy you read and possibly in copy you edit.

The idea that accents are not really an integral and mandatory component of good copy – it’s just so passé.

Hyphens and dashes

You need to know when to use and how to enter an en dash (between these letters: a–b) and an em dash (between these letters: a—b).


The standard wordspace (what you get when you press the spacebar on your computer) is noncontroversial in the hack context.

Among the many available whitespace characters:

These space characters are tricky enough that you should use character entities for them. Why? So you can identify them with certainty in source code. Two normal wordspaces look a lot like an em space. Of course no one writes two consecutive spaces in normal English prose, but HTML source code and many other contexts permit two consecutive wordspaces, collapsing them to a single wordspace. To tell those apart from an intended em or en space, use a character entity for them.

For regular Web pages, just use   and   for em and en space. Thin space is, not surprisingly;  . Non-breaking space is  .


Dozens of arrow characters exist in Unicode and you should never use sequences of punctuation in their place. To show four of many options:


Many commonly-used fractions are available as predefined Unicode characters. There is no reason to write something like 3 1/4 ever again. (Also inadvisable because, somewhere along the way, a line will break between the integer and the fraction.)

Subscripts and superscripts

Do not try to fake subscript or superscript numerals by using smaller font sizes. Also don’t give up and just write a normal number inside parentheses or brackets. Use the actual Unicode numerals:

Superscribed and subscribed letters and other characters are barely available in Unicode. This is another way of saying you can rely only on sub/super numerals, not letters or anything else.

Basic symbols

It shouldn’t surprise you that basic symbols exist in Unicode. As with arrows, you shouldn’t use other characters to fake them.

Muscle memory and habits

You can easily find guides on the Web about typing “special characters.” These won’t help you at all. True, you do need to learn the technical basics about how to enter a character that isn’t printed on your keyboard. But what all those online guides don’t tell you is you need to develop muscle memory and habits.

To form muscle memory and habits, you need practice, practice, practice. The way to do this is to write a lot of copy using characters that aren’t simple letters and numbers. Don’t do that often? Then put some time into your professional development and carry out exercises like these.

At all times use correct quotation marks and em and en dashes even if not in the original (a skill to develop in itself). These exercises will begin the process of instilling the muscle memory of typing the right characters.

If you aren’t a ten-finger touch-typist, do the same exercises. Even if you hunt and peck, you have your own muscle memory to develop.

Three ways to enter a character

Excluding speech recognition and other edge cases, you have three options for entering any character.

And here’s how you absolutely are not going to enter a character:

This is not a time for “platform equivalency”

Typically but not universally, print publications are all-Mac shops. Online publications, and freelancers and independent bloggers, very often use Macs.

These people have nothing to worry about. It is dead simple to write, save, and transmit clean copy on Macs. Important keystrokes on Mac have not changed since the introduction of the Macintosh in 1984 and it is readily possible for a ten-finger typist to touch-type them.

The problem here is Windows. You need to have a lot of Unicode knowledge even to debate the issue, but just take my word for it on two counts here, and accept that I am not saying this just because I am a Macintosh supremacist.

One explanation for why you see so many incorrect characters online is simple: The writers are on Windows, and even if they know the right character, they pretty much can’t figure out how to.

So: This is not a time for platform equivalency. Use of Microsoft Windows prevents people from producing clean copy in the real world. Hence they over-rely on smart quotes, but that just means the system beats you up twice – once in your inability to type a character, once more when the computer guesses wrong.

Solution for Windows users

There is a reasonable fix for professional writers using Windows: Turn on the U.S. International keyboard layout. Do that and all of a sudden you have the equivalent of a Mac keyboard layout and nearly everything just works. (More details.)

Common errors you need to fix

When somebody hands you copy, you need to do all the following to ensure it is clean:

Exporting from Word and PDF

You will receive countless MS Word files and PDFs. These are a leading cause of borked Unicode, especially when block quotations from these sources are plunked inside correctly-encoded text.

What goes wrong?

The following workflow avoids most mis-encodings from Word and PDF.

Common errors on specific platforms

Policy changes

I believe every publication, including every individual blogger, should specifically invite corrections of typos and copy errors. A lot of sites do that already, but I think it should be universal. And the invitation to submit corrections should specifically mention that readers can report characters that don’t show up properly. In other words, you or your publication should have a stated policy inviting people to report character-encoding errors in the same way they’d report any other copy error.

Published 2011.11.14 ¶ Updated 2018.07.23

Homepage: Joe Clark Homepage: Joe Clark Media access (captioning, Web accessibility, etc.) Graphic and industrial design Journalism, articles, book