Common issues in SGML for access

Let's consider the interplay of audio, video, and text in access technologies:
- In captioning, we start with an audio stream and turn it into text by transcribing it. We then modify that text to a presentable form based on conventions and rules and produce a final textual output on-screen (or in some other place). Thus captioning starts with speech, works with text at the intermediate stage, and outputs text.
- In subtitling, we start with an audio stream and turn it into text by transcribing it (typically), then, at the intermediate stage, translate that written text into another written text in the target language (typically). We then modify that text to a presentable and linguistically optimized form and produce a final textual output on-screen. Thus subtitling starts with speech, works with text in two languages at the intermediate stage, and outputs text.
- In audio description, we start with an audio-visual stream to which written annotations are added. That text is checked, timed, refined, clarified, and so on, turning it into a script. During a recording session, the script is turned into audible words, which are its final form. Thus A.D. starts with audio, works with text at the intermediate stage, and outputs audio.
- In dubbing, we start with an audio-visual stream (though the visuals are somewhat less important than in A.D.) and turn the audio into text by transcribing it (typically). We translate that text, then modify and optimize it for clarity and suitability in the target language, which turns the text into a script. The script is then recorded with new actors voicing the parts of the original actors. Thus dubbing starts with audio, works with text in two languages at the intermediate stage, and outputs audio.
TV closed-captioning of prerecorded programs in North America is generally done using any of several rather primitive DOS programs.
Real-time captioning of live programs typically uses the same software and hardware with the addition of a very skilled court reporter who enters dialogue into a stenotype machine (along with other annotations necessary to captioning). The words are then sent out for display on a decoder-equipped TV. (That's stenotypy, not typing on a QWERTY keyboard. Stenotypists use compact keyboards that require depressing groups of keys simultaneously to produce a word or part thereof. For further information, look at Gary Robson's captioning FAQ.) Those entries are in shorthand and are then translated into actual words via lookup tables. (This means that homonyms like "four," "for," "fore," "IV," and "4" require distinct keystrokes. It's not exactly easy keeping track of all those keystrokes, which number in the hundreds, any of which could come up at any time in dialogue being captioned.)
Closed-captioning in North America is encoded on Line 21 of the vertical blanking interval. The VBI is a narrow band of picture lines, all of which are normally invisible, positioned between the bottom and the top of the TV picture. (That's not a totally accurate description, but if you have a TV with a vertical-hold control, you can set the picture rolling slowly and see the VBI as a mostly-black bar between the top and bottom of the picture.) North American TV signals are made up of 525 lines (again, not totally accurate); the top 21.5 lines are in the VBI and are ordinarily invisible. (They're not magic. They're perfectly visible if you look for them. It's just that new TV sets are adjusted to keep the VBI out of sight.) Captions are encoded on line number 21 of those 21.5 lines. The caption codes take the overt visible form of relatively wide rectangles of light that flit back and forth. Home VCRs have no trouble recording and playing those signals.
CC in PAL-standard countries like most of Europe and Australia comes about as an offshoot of the World System Teletext technology. You just tune to a certain page of teletext (888, usually) and you suddenly see captions on any captioned show. This technology uses several lines of the VBI; all the encoding takes the overt visible form of tiny dots in the VBI which are too small for anything but Super-VHS VCRs to record. This is a severe limitation, but there are some provisos to it.
In all closed-captioning systems where prerecorded programs are concerned, we start with a master tape (or in some other medium-- tape is still used in virtually every case). That master is duplicated twice in a form that contains an explicit timecode (usually SMPTE timecode) for every frame of the video. Captioners use one of those dubs, typically with timecodes visible or burned into the image so the timecodes will be readily available. (Hardware and software also are able to read the timecodes directly; the visible timecodes are something of a backup.) The captioning process takes place on a microcomputer or a dedicated workstation (viz. in PAL countries). The final result is a text or other computer file replete with timecode markers (for when captions appear and disappear, for example), the actual text of the captions, formatting codes, and the like. This file is mated with the other timecoded master tape, which never has visible timecodes, in a process called encoding, which does exactly that-- record the caption codes into the appropriate line(s) of the VBI. The result is an encoded master tape.
Obviously nonlinear video-editing systems could conceivably dispense with the necessity for duplicating the original master tape, and could conceivably handle the encoding step as well. But there are enough makes and models of captioning software and encoding software and hardware that universal compatibility doesn't exist: Your captioning software has to be compatible with your postproduction house's encoding software and hardware, for example.
Transcoding a caption file from, say, Line 21 to PAL isn't remotely easy. The number of frames per second differs (30 in NTSC, 25 in PAL-- and is it 24 in SECAM?), and available character sets and conventions differ. Captioners in PAL countries rarely use uppercase only, for example, so transcoding the U.S. captions of The Simpsons in Australia will produce a tape that clearly betrays its heritage in another country. (Similar issues come up in the oddball U.K. Line 22 system, which I have written about elsewhere.)
Typography in both the Line 21 and WST systems is crap. Megacrap, even. Fonts are not under the control of anyone with typographic knowledge or training. Characters are generated only in the decoder or decoder chip, and decoder or chip manufacturers decide what the font will look like. There is no known case in which TV or chip manufacturers have contracted with qualified type designers to create caption fonts.
In all cases, we are talking about fonts reminiscent of dot-matrix printers circa 1982. Most fonts in Line 21 systems do not offer descenders in the lowercase gypqj, making the lowercase so poorly readable that, since Day 1 of Line 21 closed-captioning, captioners have used uppercase for nearly all text even though uppercase is also hard to read. In Line 21, activating or deactivating italics, underlining, or the like inserts a space. Italics simply are not available in PAL World System Teletext captioning. Alignment in Line 21 systems is poor but, by industry agreement, by the year 2002 captioners will have available to them new codes that will permit niceties like true centering and right justification. (For further information on this topic, which I should really write a full treatise about, check my article "Typography and TV Captioning," Print, January/February 1989. Also look at the bibliography of captioning articles I've written.)
Captioning is a huge industry. Effectively all prime-time shows on all U.S. and Canadian networks, virtually everything remotely resembling a newscast, many daytime shows, thousands of home videos, most national commercials, lots of music videos, training tapes, and more are captioned. This is a source of money and a source of intellectual property. But the tools being used for captioning are primitive. (Also, caption quality is generally poor. Don't let anyone tell you otherwise.)
Audio description on TV is relatively rare. PBS is the biggest source of A.D.; described programs carry a mix of descriptions + main audio in the Second Audio Program subchannel of stereo TV. (If you have a stereo TV-- most midrange to high-end models are stereo-- you can set your TV to SAP. Won't do you much good, though, for everyday TV in programs not using audio description-- only a few stations broadcast in stereo and virtually none use SAP.) The descriptions, then, are "closed": You needn't be bothered with them unless you want to be. Unfortunately, while all TV signals have a VBI, not all have SAP, so audio description is not a ubiquitous medium the way CC is.
WGBH, the Boston PBS Überstation, is a dynamo in access technology. It is home to the Caption Center (oldest captioner on earth, and the best, though their standards are slipping), the Descriptive Video Service (does A.D. for PBS and other clients, and also sells a small home-video line of movies with always-audible descriptions; DVS is generally quite reliable and often produces stunning work), and the National Center for Accessible Media, which researches new technologies, like Web captioning and captioning in movie houses. Even these people aren't really thinking all that broadly about the potential and future of access technologies, though again that has many provisos.
To caption a prerecorded program, you transcribe it. Usually the captions are an edited version of that transcript-- reading is slower than speaking, and there are speed limits to caption transmission-- but if you retained a verbatim transcript with all proper annotations of sound effects (phone ringing, thunder, etc.) and speaker identification, among other structural issues clearly amenable to SGML encoding, suddenly you have a viable text-only analogue of an audiovisual program.
It gets better: Audio description typically happens during pauses in dialogue. A.D. scripts, then, are quite short-- up to 100 or 200 bursts of narration. However, it's possible to describe a whole program nonstop, and in fact one project I'm working on will do just that. If you unite either or both of these A.D. scripts (i.e., conventional and continuous description scripts) with the CC script, suddenly you have a rich and complete text-only approximation of an audiovisual program.
What can you do with that information? Archive it, either on the Web or your own computer or elsewhere. Monitor it continuously for keywords. (It is believed that the NSA has done exactly that for years.) Use it for people who don't want to wait 20 minutes to download a choppy videoclip from a Web site. And, of course, use it for its intended purpose, access.

Where research is needed:

SGML. Markups for everything from what takes the overt form of italics (which have reserved functions in captioning along with all the regular uses of italics in print) to speaker IDs to caption-on and -off times to various annotations for A.D. tracks are all needed. How is this useful? Really sophisticated captioning/A.D. software could be developed. More relevantly, existing nonlinear digital video-editing systems (à la Avid, Scitex [alternate link], and Media 100) and programs like Premiere and Acrobat could be extended to understand SGMLified access codes. This same development process would have to encompass subtitling and dubbing, too, which I am not talking a whole lot about here.
Also, if captions were stored as part of an SGML structure, they could be automatically reformatted in real time for different display devices, like an LED screen (with a character set different from TV and/or inverted for viewing in a mirrorized display), TV pop-up captions, TV scroll-up captions, a continuous text-only stream without paragraph and caption breaks destined for computers, or an offscreen large-print display for visually-impaired viewers. Or captions created with one software package could be read and understood by another-- or another country's system. Right now it is quite tedious to reformat Line 21 CC for PAL CC, and there are various typographic issues that come up here.
Web access. Trying to educate Webmasters that the WWW is not an excuse to post pretty pictures is a battle we've already lost. But making those graphics accessible is possible; the WGBH site shows some preliminary techniques, particularly the use of offboard descriptions (look for the D links). It's also possible, though currently difficult, to make Web-based audioclips and videoclips accessible. However, in the absence of standards like SGML, which may not be a sufficient standard in itself, there is no way to define the data types necessary for access and no way to make such data interoperable and readily translatable. Worse, most software used in the creation and playback or display of graphics, audioclips, and videoclips offers no provisions at all for access technologies, a fact about which most software makers are either ignorant or unrepentant.
Subtitling and dubbing are the norm outside English-speaking countries and are not unheard-of within those countries. Both subtitling and dubbing can be found in the same movie; it is then possible to caption subtitled and/or dubbed movies, also to audio-describe to them. But subtitling and dubbing rely on analogue techniques, like title cameras, typescript, and recording-studio sessions. Apart from the fact that both techniques are badly in need of automation, if SGML DTDs existed for subtitling and dubbing it would be easier to create derivative versions and to archive and otherwise make use of the resulting data.

Where things stand

First, look at the detailed discussions I've written of possible uses of SGML in captioning and in audio description (with example).

I am interested in setting up a working group to create SGML structures for only the four access technologies I mentioned. SoftQuad isn't interested. Is anyone else? Let me know. With sufficient interest, I may set up a mailing list to work on these topics; in the interim, consider subscribing to the Media Access mailing list, where we discuss all manner of topics related to captioning, audio description, and other means of making media of information accessible.

I've had a few nibbles from researchers, which I will merely place pointers to here. My mind is aswim with information, and I'm not even remotely close to a point at which I could summarize it for you. Best wade in yourself. Your options now are:

Zip directly to HyTime, which can create correspondences between time-based information like music and and digital representations of music scores. (HyTime was also used to create Standard Music Description Language, details of which are available only in a PDF that's 49 pages long-- too lengthy to read on-screen, too lengthy to justify printing out. No easy summary of SMDL appears to exist.)
Read an E-mail from John Lowe about <timeline>, which uses components of the Text Encoding Initiative for "time-aligned" text and sound.

All four links on the SGML for Access homepage homepage lead here. You can use your browser's back command to retrace your steps, or go back to the Joe Clark homepage or the SGML for Access homepage.