You are here: joeclark.orgAccessibilityCaptioningBest practices in online captioning

Reformatting and reuse

The TILE project is concerned with the creation of accessible learning objects, which may be stored in a repository. Video can be a learning object, and the way to make video accessible to deaf people is to caption it. (Sign language is another option in some cases.)

TILE’s Learning Object Repository aims to make it possible to reuse and modify learning objects according to instructor and learner needs, which may include accessibility.

Giving new life to captioned video

Online video with captions could be useful in a number of ways, some of them hypothetical at the moment:

  1. The captions could be used to create a transcript for presentation in a Web browser or for other reuse. Not only can certain people with disabilities (like deaf-blind people) benefit from transcripts, they’re useful as feedstock for search engines. They can even be resold and republished.
  2. The captions can be used to develop a master list for subtitling or dubbing. Such a caption script, with timecodes included, is especially helpful in those fields, and is sometimes useful in audio description.
  3. Near-verbatim captions could be edited to easy-reader captions.
  4. Captions could be transcoded to another format, like DVD or television.

Structure and metadata

What’s standing in the way of the easy reuse of captioned video as a learning object is the incompatibility of data formats.

Source files

Original caption files take a number of forms, a fact that is unlikely to change in the foreseeable future.

Even if every set of captions used exactly the same file format, the form and content of different sets might still be incompatible.

It appears that reuse and reformatting of captioned video may not be straightforward for the Repository.

Target files

The forms into which online captions are translated will add further constraints all by themselves.

  1. Plain-text transcripts can be seen as the simplest possible transformation. However, linebreaks and character encoding are important technical obstacles,. Many typographic settings (including line length and chunking into paragraphs) can make a text file easier or harder to read.
  2. Plain text is accepted as an unstructured format. You have no choice but to add structure to it if you want to transform it further.
  3. The upside of plain text is that it’s easy to write. A growing number of blogging tools (e.g., Markdown) have been developed to transform easy-to-write plain text that uses certain conventions into clean semantic (X)HTML. (The separate format=flowed specification is a kind of semistructured plain text.)
  4. Converting caption files into full transcripts for use on the Web requires a knowledge of semantic HTML, which can at least be semi-automated.

Case study: Full-text transcripts

By far the highest-profile transformation is turning captions into full-text transcripts, since transcripts are permitted as an accessibility measure under WCAG 1.0. Many sites with captioned video also provide transcripts, and some sites provide transcripts instead of captioning.


We draw a distinction between two kinds of transcripts.

A benefit of a transcript file is searchability. Any file in a format understandable by search engines (which really means Google) will function as a searchable transcript. (Google’s file-reading capabilities alone mean nearly every common format can be used, including many word-processing files, PDF, and PostScript.) No structure whatsoever is actually needed.
These transcripts are intended for human readers, who may avail themselves of assistive technology like screen readers. Typography and presentation become important, but underlying structure becomes even more so.

A human-usable transcript will usually be machine-readable, unless it’s presented in a format machines cannot understand (e.g., a PDF made from scanned pages or faxes).

Most machine-readable transcripts can also be read by people, but that doesn’t mean it’s a pleasant experience – and for learning-disabled persons, screenfuls of plain text are quite inaccessible. Simple text dumps of Line 21 real-time caption files are difficult to read, since they’re usually set in all-capitals with certain artifacts of closed captioning intact (as, for example, the use of >> to denote a speaker change). These would not be considered native Web documents, given their perfunctory use of HTML and poor typographic presentation.

Other accessibility requirements

Alternatives to inaccessible Web content – e.g., transcripts of uncaptioned video – also have to meet accessibility standards. Web Content Accessibility Guidelines Priority 2 requires valid, semantic markup. That applies equally to transcripts created from captions.

Semantic markup is a term that has grown in currency among standards-compliant Web developers, who are now quite numerous. (Wayne Burkett provides a good introduction and links.) Paul Prescod defines “semantic markup” thus:

It is markup which captures sufficient amounts of the structure of the document to allow a reasonably broad set of automated processes to do their jobs.

For Web sites, semantic markup requires authors to use the actual and most correct element or attribute for the content they’re marking up. Not every chunk of text is a paragraph, for example; some text is an address, a heading, a citation, a quotation, or a list item. Since HTML has a limited range of elements and attributes, sometimes an author will have to make a best guess based on the definitions in the HTML specification.

For XHTML documents, things are more complex. XHTML is HTML and XML at the same time. XML, by definition, is eXtensible. In theory, you can define your own elements, though in practice few do. (One counterexample is custom-hacking a DTD to make the embed element legal.)

Also in theory, different flavours of XML should be translatable. In online captioning, in principle a device should be able to translate SMIL into XHTML or to XHTML+SMIL, since all of those are XML. (At least one such converter has been published, though at time of writing the actual utility was unavailable on the Web.) Worse yet, the World Wide Web Consortium hasn’t even published a DTD for SMIL+XHTML!

But these converters are unlikely to understand the semantics of the underlying caption file. The XHTML they produce may be minimal or semantically incorrect (or faux-generic, as with the use of p or pre elements throughout). The converters will use XHTML elements, but won’t necessarily use the correct elements. Still, it is possible to list some requirements for a future conversion system.

Functional spec

A system that converts from a SMIL caption file into XHTML should meet these requirements:

Not a simple transformation

Turning captions into transcripts has been presented as an easy and beneficial consequence of doing captioning in the first place. It’s certainly beneficial, but doing it right is not easy.

It may be too much to expect amateur captioners to produce valid SMIL and valid, semantic XHTML transcripts. To make the process easy for authors, a write-once/use-many approach is what we need. We could imagine a transformation utility like Markdown that lets people transcribe audio in a lightly-modified plain text that could be published as-is and also autoconverted to SMIL and XHTML. This is, however, merely hypothetical.

Implications for the Learning Object Repository

The Learning Object Repository should always attempt to retain an original file format and any helpful transformations. In this case, original caption files in any format (SMIL, SAMI, or whatever else) should always be retained, with due care to maintain binary files.

But when it comes to transformations, a repository of caption files can use any format that machines can understand. If “machines” means search engines in this case, and “search engines” means Google, nearly any common format could be stored – plain text, Word or PDF files, XHTML. Nonetheless, for human readability and longevity of the saved transcripts, the Repository should prefer valid, semantic XHTML caption transcripts even if they are presently difficult to create.

Closed vs. open captioning

It may seem as though captions displayed using players’ own functions are more desirable to the Learning Repository than burned-in open captions. That’s because the players’ files are composed of actual characters rather than pictures of characters; real characters are easier to transform.

But as we have shown, what’s actually important is valid, semantic XHTML transcripts, or, under battle conditions, plain text or some other form. Burned-in open captions also begin their lives as computer files and are not necessarily any harder to convert to desirable forms. There is no reason related to longevity and reuse to prefer player closed captions over burned-in open captions.