You are here: joeclark.org → Accessibility → Captioning → Best practices in online captioning

Reformatting and reuse

The TILE project is concerned with the creation of accessible learning objects, which may be stored in a repository. Video can be a learning object, and the way to make video accessible to deaf people is to caption it. (Sign language is another option in some cases.)

TILE’s Learning Object Repository aims to make it possible to reuse and modify learning objects according to instructor and learner needs, which may include accessibility.

Giving new life to captioned video

Online video with captions could be useful in a number of ways, some of them hypothetical at the moment:

The captions could be used to create a transcript for presentation in a Web browser or for other reuse. Not only can certain people with disabilities (like deaf-blind people) benefit from transcripts, they’re useful as feedstock for search engines. They can even be resold and republished.
The captions can be used to develop a master list for subtitling or dubbing. Such a caption script, with timecodes included, is especially helpful in those fields, and is sometimes useful in audio description.
Near-verbatim captions could be edited to easy-reader captions.
Captions could be transcoded to another format, like DVD or television.

Structure and metadata

What’s standing in the way of the easy reuse of captioned video as a learning object is the incompatibility of data formats.

Source files

Original caption files take a number of forms, a fact that is unlikely to change in the foreseeable future.

The original transcription that is the basis for a segment of captioned video may be plain text – in any of several encodings – or created in a word processor (invariably Microsoft Word).
To function as captions, structure must be added in the form of timing, positioning, chunking, editing, and addition of speaker IDs and non-speech information (NSI), among other forms.
Two structured file formats for online captioning are SMIL and SAMI. More standards are on the horizon, including Timed Text.
But many other captioning and subtitling formats are in use – so many that any high-volume captioning project is sure to come up against format incompatibilities.
Even data formats ostensibly based on the same language – like SMIL and XHTML, both of which are also XML – can be incompatible. (SAMI is merely XML-like.) XML is so generalized that one form does not immediately transform into another.

Even if every set of captions used exactly the same file format, the form and content of different sets might still be incompatible.

Some captions are customized for the intended display device:
- Line 21 captions can have italics and differentiate speakers by position and explicit speaker ID (itself differentiable from other text by punctuation or case); 32 monospaced characters appear in a maximum of four lines out of a possible 15, among many other constraints.
- Rear Window® captions are offscreen, have no italics, and are limited to three lines. Vertical positioning is all but nonexistent.
- DVD subpictures may actually be same-language subtitles, with invariant bottom-centre positioning, no speaker IDs or NSI, and many uncaptioned words and phrases.
- SAMI captions may assume that an explicit speaker ID will be displayed with every utterance by that speaker.
Most captioners edit verbatim captions at least some of the time. Some captioners (and, in some countries, most captioners) heavily edit the verbatim speech. Your options to transform a caption file will depend on which caption file you have to work with. (Edited captions make for poor transcripts.)

It appears that reuse and reformatting of captioned video may not be straightforward for the Repository.

Target files

The forms into which online captions are translated will add further constraints all by themselves.

Plain-text transcripts can be seen as the simplest possible transformation. However, linebreaks and character encoding are important technical obstacles,. Many typographic settings (including line length and chunking into paragraphs) can make a text file easier or harder to read.
Plain text is accepted as an unstructured format. You have no choice but to add structure to it if you want to transform it further.
The upside of plain text is that it’s easy to write. A growing number of blogging tools (e.g., Markdown) have been developed to transform easy-to-write plain text that uses certain conventions into clean semantic (X)HTML. (The separate format=flowed specification is a kind of semistructured plain text.)
Converting caption files into full transcripts for use on the Web requires a knowledge of semantic HTML, which can at least be semi-automated.

Case study: Full-text transcripts

By far the highest-profile transformation is turning captions into full-text transcripts, since transcripts are permitted as an accessibility measure under WCAG 1.0. Many sites with captioned video also provide transcripts, and some sites provide transcripts instead of captioning.

Readability

We draw a distinction between two kinds of transcripts.

Machine-readable: A benefit of a transcript file is searchability. Any file in a format understandable by search engines (which really means Google) will function as a searchable transcript. (Google’s file-reading capabilities alone mean nearly every common format can be used, including many word-processing files, PDF, and PostScript.) No structure whatsoever is actually needed.
Human-usable: These transcripts are intended for human readers, who may avail themselves of assistive technology like screen readers. Typography and presentation become important, but underlying structure becomes even more so.

A human-usable transcript will usually be machine-readable, unless it’s presented in a format machines cannot understand (e.g., a PDF made from scanned pages or faxes).

Most machine-readable transcripts can also be read by people, but that doesn’t mean it’s a pleasant experience – and for learning-disabled persons, screenfuls of plain text are quite inaccessible. Simple text dumps of Line 21 real-time caption files are difficult to read, since they’re usually set in all-capitals with certain artifacts of closed captioning intact (as, for example, the use of >> to denote a speaker change). These would not be considered native Web documents, given their perfunctory use of HTML and poor typographic presentation.

Other accessibility requirements

Alternatives to inaccessible Web content – e.g., transcripts of uncaptioned video – also have to meet accessibility standards. Web Content Accessibility Guidelines Priority 2 requires valid, semantic markup. That applies equally to transcripts created from captions.

Semantic markup is a term that has grown in currency among standards-compliant Web developers, who are now quite numerous. (Wayne Burkett provides a good introduction and links.) Paul Prescod defines “semantic markup” thus:

It is markup which captures sufficient amounts of the structure of the document to allow a reasonably broad set of automated processes to do their jobs.

For Web sites, semantic markup requires authors to use the actual and most correct element or attribute for the content they’re marking up. Not every chunk of text is a paragraph, for example; some text is an address, a heading, a citation, a quotation, or a list item. Since HTML has a limited range of elements and attributes, sometimes an author will have to make a best guess based on the definitions in the HTML specification.

For XHTML documents, things are more complex. XHTML is HTML and XML at the same time. XML, by definition, is eXtensible. In theory, you can define your own elements, though in practice few do. (One counterexample is custom-hacking a DTD to make the embed element legal.)

Also in theory, different flavours of XML should be translatable. In online captioning, in principle a device should be able to translate SMIL into XHTML or to XHTML+SMIL, since all of those are XML. (At least one such converter has been published, though at time of writing the actual utility was unavailable on the Web.) Worse yet, the World Wide Web Consortium hasn’t even published a DTD for SMIL+XHTML!

But these converters are unlikely to understand the semantics of the underlying caption file. The XHTML they produce may be minimal or semantically incorrect (or faux-generic, as with the use of p or pre elements throughout). The converters will use XHTML elements, but won’t necessarily use the correct elements. Still, it is possible to list some requirements for a future conversion system.

Functional spec

A system that converts from a SMIL caption file into XHTML should meet these requirements:

Full compliance with Authoring Tools Accessibility Guidelines.
Detect the programming genre the SMIL file represents. From that data, select a particular XHTML document style:
- For lectures or similar videos with few speakers with extended utterances, headings may suffice for speaker names and paragraphs for the words spoken.
- For presentations with several speakers and/or a great deal of non-speech information, the definition list is the correct markup by spec, with speaker IDs and NSI marked up as dt and utterances as dd.
Tables may be used, particularly in XHTML documents that unite caption and audio-description text. Speakers can be marked up as th scope="row", utterances as td, and description text as multicolumn td colspan="2". This verbose option, which can be tedious to handle in screen readers, should be used when no other format fits the document structure.
The system may ask the author to select from the above templates, or it may also use logic to auto-sense the correct document type.
Optionally convert all-capitals text (in languages with such a thing as all capitals) into mixed case, taking care of common exceptions in that language (English I; French adjective/noun pairs like anglais/Anglais).
Interpret text-only speaker IDs and non-speech information and provide markup, possibly with author intervention. For example, >> in Line 21 real-time captioning indicates a speaker change, and words immediately following those characters may be the name of the speaker. The system may need to ask for human interpretation of structurally similar phrases like [ Laughs ] and [ Chris ].
Generate valid code by default even if the program makes occasional mistakes in semantic interpretation. The system should make a best guess for utterances, IDs, NSI, or passages and permit authors to override them.

Not a simple transformation

Turning captions into transcripts has been presented as an easy and beneficial consequence of doing captioning in the first place. It’s certainly beneficial, but doing it right is not easy.

It may be too much to expect amateur captioners to produce valid SMIL and valid, semantic XHTML transcripts. To make the process easy for authors, a write-once/use-many approach is what we need. We could imagine a transformation utility like Markdown that lets people transcribe audio in a lightly-modified plain text that could be published as-is and also autoconverted to SMIL and XHTML. This is, however, merely hypothetical.

Implications for the Learning Object Repository

The Learning Object Repository should always attempt to retain an original file format and any helpful transformations. In this case, original caption files in any format (SMIL, SAMI, or whatever else) should always be retained, with due care to maintain binary files.

But when it comes to transformations, a repository of caption files can use any format that machines can understand. If “machines” means search engines in this case, and “search engines” means Google, nearly any common format could be stored – plain text, Word or PDF files, XHTML. Nonetheless, for human readability and longevity of the saved transcripts, the Repository should prefer valid, semantic XHTML caption transcripts even if they are presently difficult to create.

Closed vs. open captioning

It may seem as though captions displayed using players’ own functions are more desirable to the Learning Repository than burned-in open captions. That’s because the players’ files are composed of actual characters rather than pictures of characters; real characters are easier to transform.

But as we have shown, what’s actually important is valid, semantic XHTML transcripts, or, under battle conditions, plain text or some other form. Burned-in open captions also begin their lives as computer files and are not necessarily any harder to convert to desirable forms. There is no reason related to longevity and reuse to prefer player closed captions over burned-in open captions.