Joe Clark: Media access

You are here: joeclark.orgCaptioning and media accessDVD → → Comments on guidelines for audiovisual (talking) DVD/STB menus → WGBH → Major errors

IntroductionMinor errors

Updated 2003.08.28

Major errors in WGBH guidelines for accessible (talking) DVD/STB menus

Where to find the original

Read WGBH’s “Developer’s Guide to Creating Talking Menus for Set-Top Boxes and DVDs.”

Errors and comments

Talking menus tied to description

The guidelines fail to point out an obvious fact: Don’t bother making DVD talking menus if the contents of the DVD aren’t accessible to the blind. In other words, the only DVDs that require audiovisual menus are those whose feature video carries audio description.

Since there are so few discs with description that I can manually maintain the list, I wish the guidelines had made it clear that the accessibility feature of talking menus is contingent on another accessibility feature, audio description.

The first task, then, is to get more DVDs released with description (especially movies that were described for first run). Then developers can start to worry about talking menus, which these selfsame guidelines do not equip them to create.

More accessible DVDs than WGBH admits

The guidelines continually state that a mere four DVD sets have been produced with audiovisual menus. That’s false and belittles even WGBH’s own achievements, let alone everybody else’s.

Yes, Lincolns, Garvey, Chicago: City of the Century, and Partners of the Heart have audiovisual menus that were designed, at least in part, by GBH. But so does The Grinch, a disc with audio descriptions, audiovisual menus, and (some) captioning by WGBH.

Meanwhile, the German DVD of Dancer in the Dark has an “Audiovisuell Menu-System,” as does an upcoming line of instructional DVDs by DeKalb Video Productions.

Hence, DVDs with talking menus are still rare, but not as rare as the guidelines suggest. The full list of known DVDs with accessible navigation could easily have been given.

Getting lost

The guidelines don’t explain what an author should do, apart from providing a repeat function, to keep users from getting lost. But the repeat function itself is a problem: The author would have to program a keystroke to load not only the graphic menu but probably an entirely different audio stream, since the guideline calls for the name of the menu and the current message to be revoiced. In all likelihood, the last thing the user heard was merely the message, not message plus name of screen.

(Recall that, as the guidelines explain, you can attach only a single sound file to each menu. To simulate the voicing of multiple menu items, you must clone the graphical menu and attach a different voice stream to each one.)

The user probably knows what screen he or she is looking at. A really good help system would use two stages:

  1. Hit the Repeat key (if such a thing were possible) to restate the previous message. (That should already be the sound file that was loaded with the duplicate of the graphical menu.)
  2. Hit the key again to restate the name of the screen.

The permutations involved in creating this kind of programming would be unwieldy. For each menu region on a screen, you’d need to program two other states – reiteration of the current audio stream (the spoken name of the menu item) plus another stream that articulates merely the name of the screen. Nobody’s gonna bother.

Moreover, a separate guideline advises:

Repeat the navigation instructions as often as seems appropriate for the expected use pattern. For example, a DVD that someone might pick up only occasionally should frequently remind the user of what he or she needs to do next.

Since we’re also told to remind people what screen they’re on even while they’re on it, I do not understand what kind of disc might be a heavy-use title that could get by with fewer reminders.

The whole point is moot if the disc explains how to use itself once at the opening menu.

Few-switch operation

How to avoid getting the user lost impinges on another issue: Few-switch operation.

The guidelines spend a lot of time discussing operation of DVDs or STBs with few switches – just a couple of the buttons on a remote control. But some remote controls have many buttons that could be used. For STBs, often the subscriber receives a custom remote control. Mine has 23 keys other than arrows, numbers, Select, and Power, and at least seven of those are remappable function keys.

The guidelines state:

But Joe can interrupt the sequence at any time simply by pressing one of the arrow keys on the remote. A single down-arrow press will interrupt the audio output and put the STB talking menu into a suspended state. Another single down-arrow will read the next channel entry. Two down-arrows in quick succession will skip to the next time slot. Three down- arrows in quick succession will skip to the next day.

While this is similar to clicking, double-clicking, and triple-clicking in a graphical user interface, it is confusing to say the least to use a key that implies motion (down-arrow) to pause. It would be easy for a novice user to press down-arrow intending to go to the next item, only to find that the system has paused. Sure, pressing the key again will resume the system, but it’s too confusing. (Keys that do opposite things depending on whether they’re pressed an even or odd number of times are known to represent bad usability in most usages.)

And at any rate, the guidelines contradict themselves later:

Using the remote, he steps through the list and selects Movies. Once again, he listens as the talking menu reads down the list of current programs. One choice sounds intriguing and he pauses to learn more. Using the remote he highlights the title of the movie and presses the Select key on the remote. The system responds by giving him a detailed description of the film.

If pausing “will interrupt the audio output,” how can Joe “pause to learn more”? Here he can presumably “highlight” the movie because he already heard its name, so he can up-arrow to reselect it. Or is the cursor on the movie he’s interested in? After all, he paused. Why should Select provide a description rather than selecting the film – that is, tuning it in?

But the guidelines also tell us:

Provide an easy-to-use repeat function on the remote control. The user should be able to ask the system to repeat the previous message. Repeat the name of the current screen when repeating the previous message, as a means of preventing the user from becoming lost in the menu tree.

If we’re doubling up functions on arrow keys, why are we assigning another key the function of a repeat sequence? What other key will that be, especially on a DVD player where the spec requires a minimum range of keys all of which are already spoken for?

The issue continues:

Questions, then:

Haven’t we run out of keys on the remote by now? The guidelines suggest keystrokes for:

  1. Voice navigation on/off at startup
  2. Voice navigation on/off at any time during disc play
  1. Pause/resume
  2. Next entry
  3. Next timeslot
  4. Next day
  5. Information (modeled by WGBH as Select)
  1. Repeat

Yet after listing, in bits and pieces throughout the document, a new set of control keys that needs to be devised, the guidelines also tell us:

Make key sequences on the remote control as simple as possible. Try to achieve all necessary functionality using only the arrow keys and the Select key.

Audio feedback

Always respond to a user choice. When moving from menu item to menu item, the interface should announce each item. When selecting an item from a menu, the interface should acknowledge that choice before taking action.

How does the system “acknowledge that choice”? With a beep? Won’t that get annoying?

Why do none of the WGBH-produced accessible DVDs I have actually acknowledge choices?

This seems to be an issue of keystroke feedback. That’s the reponsibility of the hardware manufacturer. I doubt that blind viewers need voices, beeps, and their own sensation of detents or clicking in their remote controls all at once.

Conclusions on wayfinding and keystroke usage

The issue of keeping a user from getting lost, and using as few or as many keys as necessary to do it, impinges on a kind of audio wayfinding. But our hands are tied due to the hardware and software specs.

It seems that it is unrealistic to assume that audiovisual DVDs can be fully used with just the arrow keys, the Select key, and maybe the number keys (or just the number 1 key). Here, “fully used” means “with all the error-correction and user-interface features we’d like.”

We find the same problem in screen readers, which must double up on an already-overloaded QWERTY keyboard for operation. Really, in both these cases we need a separate keypad.

One model is that of the Kurzweil Reading Machine, that wondrous old device that scanned and read print out loud. Its control keypad used a Nominator key to identify every key on the keyboard. Press Nominator and then another key and the machine would say the name of the second key. (Yes, you could press Nominator twice.)

In a similar vein, DVD authors could hack together a script that turned, say, the little-used, little-understood Return key on a DVD remote into a kind of Help key. Press it once to repeat the current message, press it and up-arrow to explain the screen you’re on. That sort of thing.

Since the guidelines mention (barely) WGBH’s own convention of using the key sequence 1Select as a kind of convention to turn audio navigation on or off, we’re already talking about remapping key sequences here. My solution here may not be the greatest, but it is at least possible; the guidelines’ own suggestions are unworkably vague.

STB manufacturers are much better off.

The problem seems much easier to solve in STB than in DVD. But the guidelines do a poor job explaining the problem.

Age, gender, race

WGBH has generally good taste in selecting narrators for description (with one glaring counterexample). It is apparent that GBH uses male and female narrators, not all of whom, I infer from their names, are white. None of them sounds particularly old.

Yet it has never gotten its act together on the topic of describing the race of characters. Even so, the guidelines list the voice characteristics of narrators for talking menus variously as:

Some questions:

  1. In what specific and detailed way should producers handle the issue of “race” in selecting narrators, apart from compliance with human-rights and employment legislation?
  2. How are “races” distinguishable by voice?
  3. How can a speech synthesizer produce the “characteristics” of a “race”?

Repetition and looping

The guidelines give inconsistent advice on repeating or looping an audio prompt if the user has made no action. It’s also inconsistent with the behaviour of visual menus in many cases.

Every time Joe makes a selection, the talking menu announces any updated information. It also repeats the basic navigation instructions for using the up, down and Select keys on each menu, though Joe can interrupt it at any time by scrolling to the first menu item.

But elsewhere, we are warned, in contradictory sections:

And, later:

As the user scrolls down the list, successive screens, with the appropriate selection highlighted, appear as the appropriate clip plays. It is important to note that when the screen is currently showing M6 as the selection and the user presses the down button, the most logical behavior is to return to M1 and to play A1.

Remarks, then:

Conscious involvement

He scrolls through choices and when he hears “Play Program” he presses the Select key. Joe settles back to enjoy the film, impressed by the simple but elegant design of the audio navigation feature.

Actually, if Joe did that, he’d encompass the tone of the guidelines themselves – impressed by their simple but elegant design. Aren’t we clever, in other words?

This is really not a complicated issue, but it’s not well discussed in the media-access demimonde. We can’t expect users of captioning, description, subtitling, or dubbing never to be aware they are watching a production with those features, but most of the time we want the accessibility to be immediately and transparently understood without conscious knowledge.

The same holds true for audiovisual navigation. It’s rather new and unexpected, but even within a disc or disc set, after a while the audio navigation should become old hat. You shouldn’t have to be “impressed”; you should just be able to navigate.

Similarly, the following blandishment is not borne out in the real world:

Navigating between and within talking menus should be straightforward. Sighted users expect a graphic interface to be simple, elegant, intuitive and responsive. Users who are blind deserve no less.

In fact, it seems that DVD viewers are tacky as hell and work contrary to their own interests. Why else are so many commercial DVDs saddled with clumsy, repetitive menu systems, complete with full-motion video and sounds you can’t shut off?

It’s not as though the problem is unknown; Don Norman complained about it two years ago.

It’s possible that experts in usability or wayfinding or graphic design or DVD authoring may “expect a graphic interface to be simple, elegant, intuitive and responsive,” but people buying $80 DVD players to watch Jackass: The Movie probably couldn’t spell any of the multi-syllable words in that quoted sentence.

(In the interests of disclosure, I watch Jackass: The Movie on a $240 player.)

Still, an annoying audio interface is probably worse than annoying graphical interface if only because audio can bother more people, it takes up more time, and it’s harder to scan. If immediate and transparent usage of a DVD were important for anyone, it’s important for the blind more than the sighted.

Underestimating the role of words

A well-designed audio-navigation system may not always adhere to the design concept underlying the graphic interface. The audio-navigation interface must take into account the realities of human auditory processing abilities, which are distinct from the means by which we comprehend visual or even tactile information.

In this guideline, WGBH overlooks the fact that the basis of DVD navigation is the word. We’re not “comprehend[ing] visual... information,” like recognizing a face in a crowd or distinguishing a birch from a maple tree. In the majority of cases, we’re reading words and numbers.

Those same words and numbers, perhaps with small modifications, can be read aloud. Then you start to consider the sequential nature of audio interfaces, the difficulty in remembering many articulated choices, and similar differences from visual interfaces.

Linear thinking

Similarly, the guidelines are at pains to emphasize the importance of linear thinking in audio navigation:

The guidelines would have been much clearer if they emphasized functional groupings. Any DVD menu designer will understand the concept: Functions that go together stay together.

Designers know this already. A typical menu will provide various functions for its own purpose (chapter selection, audio or subtitle options, bonus features) plus subnavigation to other parts of the disc. Designers know all about structure. They may not be really good at it all the time, but they know what it is.

It’s a much simpler concept than the guidelines pretend. In fact, WGBH is positively muddled on this topic. Simple advice like “Think stucture, not appearance” is documented in a meandering way:

The third step is to create the spoken-menu structure based on the obvious or not-so-obvious list structure embedded in the graphic. For example, imagine a graphic screen with six menu items. What if three of the items are rendered in one color and relate to choice of programming, while the other three are in another color and relate to system preferences? The implication is that this list of six items is actually two lists of three items each. If the audio-navigation system simply allows the user to move from item to item, confusion may set in unless some additional audio cue lets the user know that the six items are actually two groups of three.

We need concrete examples to work with, and they’re readily available in WGBH’s own portfolio. Think of how much stronger the guidelines would have been with fully-explained real-world examples.

The active user

The guidelines seem to suggest that you the developer devise a huge range of new keystrokes, but also assume that listening to voice prompts is inherently taxing for the blind viewer.

As already discussed, people process audio information very differently than they process visual information. The graphic field is two-dimensional – imagine a plane with aspects of height and width. The eye naturally skims freely across the entire visual field, processing both dimensions simultaneously and looking instinctively for intersections between those two dimensions. This is why the eye can take in a grid, such as a program guide, and quickly find the intersection between the time of day and a given channel.

The guidelines go on at great length about “linear lists,” that is, realizing the structure of menu information and adapting it to voice. As the guidelines elsewhere admit, blind viewers are not actually trying to decode a grid visually. By definition we are simplifying the interface for listening.

But the viewer isn’t just sitting there waiting to be spoon-fed. Viewers can do things. DVDs and STBs are interactive. As one example, up-arrow and down-arrow can step through one set of onscreen options, while left-arrow and right-arrow step through a different set from the same screen. It’s so simple it’s actually been done.

Herein lies a crucial difference between the visual and auditory fields. Specifically, the auditory field is nearly one-dimensional.

But, even by using just the arrow keys, the auditory navigation takes on two, three, or four dimensions, not “nearly one.”

The true features of accessible DVD

By far the most severe failing of the guidelines is their failure to be upfront and truthful about what WGBH’s accessible DVDs actually say out loud.

As an example, the guidelines give a nice screenshot of the Chapter Selections menu from Abraham and Mary Lincoln: A House Divided, Disc One, Section I. The guidelines wax poetic about how extremely difficult it is to map two-dimensional graphical menu structures to “nearly-one-dimensional” voice.

‘Lincolns’ screenshot, with chapter selections and global menu

But it isn’t hard at all. That disc did a good job of it. Throughout Lincolns, the up-arrow and down-arrow keys move you through the menu options of that particular screen, while left-arrow and right-arrow step through the disc’s global options.

Here’s what the disc actually says out loud when you load the pictured menu:

Chapter selections. Part One: Ambition, 1809–1942.

Use the up- and down-arrow keys to scroll through the 18 chapters on this disc. To begin the film from any chapter, use the Select key.

Use the left- and right-arrow keys to scroll through the disc features.

But what do the guidelines say about this actual example?

To the eye, [the options] readily group into two categories of action: the first, seven selections to explore various chapters; the second, four selections at the bottom of the screen that allow the user to switch to different menus, including a return to the main menu. This contextual information is readily available to a sighted user at a glance and it adds, in effect, a second dimension to the list of eleven items....

To build a conceptual understanding of the difference between the first seven items and the last four items, the user who is blind will have to pay very careful attention. And it may be unreasonable for a developer to assume that the user will do so.

In order to mimic visually embedded contextual information in an auditory list, aspects of speech would have to change from one item to the next – timing, volume, emphasis, speed, language and so on. The listener would have to concentrate with great determination in order to connect items based on their spoken attributes, and if the list was long and complex, the listener could easily be overwhelmed.

Actually, as the disc itself proves, it’s not a difficult task. It doesn’t require changing “timing, volume, emphasis, speed, language and so on.” (Note the dodge of “and so on.”) You just have to tell people that two arrow keys work on one set of options and two other keys work on a separate set.

Why are the guidelines unable to accurately describe what WGBH itself has achieved on its own accessible discs? The two-directions techique (up/down vs. left/right) works nicely. It’s quite a success, actually, and shows just how simple it usually is to group menu items for voice prompting.

It’s a success story, but not one that WGBH wants to talk about.

(Interestingly, the menu structures on The Grinch, despite its being a high-profile commercial disc, are even simpler. Could it be that historical documentaries of this sort are the hardest to make work in voice? But even they are not really all that difficult. Is there much to worry about?)

Use the DVD spec

The DVD spec is a serious impediment when it comes to making accessible discs. But some features can be used right away for accessibility.

Consider numbering menu items and announcing at the top how many items are in the list. This strategy helps the user remain oriented.

You actually can number menu items already, though not many authors do it, and even among that minority, only a handful of developers will actually typeset the number of the menu item right there on the graphical menu. In essence, you have to guess.

(Jim Taylor, DVD Demystified, p. 284: “[A] button can be activated by pressing the corresponding number keys on the remote control. Some remotes activate buttons 1 through 9 with a single keypress; others require multiple keypresses.”)

First-run options

The guidelines are inconsistent about options for the first run or insertion of a disc. (What happens in that case can be separately controlled by the DVD author. You the viewer need to see the copyright warning only once per insertion, for example, if the author set it up that way.)

Some questions:

Accessibility menu

I certainly wish DVD authors would begin to recognize that captions, subtitles, audio descriptions, and dubbing tracks are accessibility features. We are living with inconsistent and unstandardized nomenclature for menus that turn such features on and off.

The WGBH guidelines occasionally suggest a separate screen for accessibility provisions:

One of the most important interface decisions to be made in the early stages of development is how to toggle the audio navigation feature on and off. It’s not a trivial matter, because if the mechanism for doing so is awkward or counter-intuitive, audio navigation may become an irritant rather than an aid.

Generally speaking, there are two ways of enabling or disabling the feature. Developers should always include an enable/disable feature within the menu tree, as shown below in the accessibility options screen.

I note that WGBH’s DVDs for PBS Home Video use an accessibility menu, but of those discs that I own, none contains a foreign-language track; they’re limited to captions, descriptions, and the talking menus. Thus, this is “classical” accessibility. I wonder if WGBH would countenance including subtitles and dubbing tracks (the latter annoyingly euphemized as “languages”) in an Accessibility menu. These specific recommendations need better real-world testing with commercial discs, not just educational or public-broadcasting discs.

Provide a menu choice, on every screen, that allows the user to enter an accessibility preference and settings area. In this area, provide an enable/disable toggle and, if possible, speaking rate and verbosity settings. Also if possible, allow users to save multiple settings, catering to the needs of multiple members of the same household.

I don’t see why it’s necessary for every single screen to include a link to accessibility options. Experienced DVD viewers (sighted ones, at least) are accustomed to seeing such a link on the main menu. Description and talking menus are not so novel and important as to require a new technique.

In any event, the above paragraph contradicts the advice to “include an enable/disable feature within the menu tree, as [in an] accessibility options screen.”

Synthesized voices

WGBH has been talking about using synthesized voice for audio description and now for audiovisual menus for over ten years. Their recommendations could use improvement, but are not bad.

The disadvantages of using synthesized voices include...

that people like them less. We’ll always choose a human voice over a machine’s.

Because so much of the relevant technology is currently in development or transition, there is no simple answer to the question of whether or not to use synthesized voices.

I don’t know what that “technology” might be. Mac OS and Windows both include speech, though it’s not very good. Synthesizers are easy to find, and two technologies with surprisingly good voices – Rhetorical and AT&T – could be licensed.

For example, DVD designers seeking to reduce costs could create synthesized voices at their own workstations, and then embed them, as they would a human voice, in the audio-navigation feature included on the disk. This would eliminate the cost of the speaker and of the recording studio. It would also allow the developer to change the content of the audio menus as easily as he or she could alter visual menus and other graphical content.

But the guidelines don’t give you any information at all on how to actually use your own computer-synthesized voice. Through this omission, the guidelines make it seem as though it’s a straightforward thing. But I rather doubt it is. And I would venture that many DVD authors, being a bit geeky, would judge acceptable many kinds of synthetic speech that real users, particularly low-vision people who don’t use speech output in their daily lives, would find hard to understand. (I’m thinking of annoying little mispronunciations and poor prosody, the overall stress or intonation pattern of a sentence.)

[T]he following recommendations can help smooth the transition when it comes to the exclusive use of synthesized voices.

Developers should follow these recommendations when creating audio-navigation prompts and responses:

Note that four of the five recommendations have nothing to do with “the exclusive use of synthesized voices” and only concern human narrators. And the fifth one gives no examples of “design or coding choices that might impede the eventual change to synthesized voices.” I certainly can’t think of any.

Screen readers

The guidelines attempt to provide suggestions for the case where digital video feeds are sent directly to a computer.

Here the guidelines give us a watered-down version of the existing published guidelines for Web accessibility (e.g., Web Content Accessibility Guidelines), which are already so hard to understand I wrote a book about them.

The guidelines’ recommendations are much too sweeping and undetailed. Also, I don’t know of any prototypes of accessible computer interfaces for set-top boxes. (The Telly product lets you control a personal video recorder with a Web browser – the screenshot even shows Mozilla – but I don’t know how accessible it is.)

There’s so much wrong with these recommendations, and so many exceptions to the rule, that one could, in fact, write another book about it. (I have opted not to rebut them in detail.) This stuff is not going to be easy for an interactive-TV developer. Essentially, we’re asking them to become Web developers, too.

Program guide developers who seek to make their products accessible to screen-reading software should follow these rules:

Obviously, because some of these recommendations involve embedding “hidden” text for the benefit of the screen reader – and because these considerations are likely new to the developers who create program guides – some reverse engineering of the basic program guide architecture may be necessary to accommodate them.

“Some reverse engineering... may be necessary”? That’s never a good sign.

I particularly object to the blanket advice to caption, transcribe, and describe multimedia. You don’t need transcripts if you’ve got captions. (It’s easy to publish transcripts from the captions, but they’re not necessary. They aren’t even desirable: Captioning is the correct way to make multimedia accessible to the deaf.) These are not simple processes and even the pros flub them much of the time.

(Also, quickie question: I know we’re actually trying to make the entire interface accessible, but if we’re concentrating on the blind, why are we worrying about captioning?)

Why don’t the guidelines explain how to put all these recommendations into force rather than telling us that they need to be done?

Bit budgets

Here’s a bit of a howler.

Managing the bit budget is a straightforward affair for DVD developers, made only slightly more complicated by the inclusion of an audio-navigation feature.

In fact, in some discs, the bit budget is the biggest headache in the production process.

Remember, there’s no reason to provide audiovisual menus if the rest of the disc doesn’t have audio description. I can imagine a very few exceptions (the video on the disc is of the rare kind that doesn’t need description), but I’ve never encountered such a case. You won’t go to the trouble to create talking menus if the rest of the disc isn’t accessible to blind viewers.

That means the true issue in calculating bit budget is not only the talking menus but the size of the description track. Adding a full stereo audio track is nontrivial in a typical commercial disc. You may very well have to take something else out – if only through a lowered sampling rate for video – to make it fit.

The bit budget is not “a straightforward affair.” In fact, in the case of the rerelease of T2, the bit budget was a confirmed issue that prevented developers from including the description track from the original disc.

Currently, DVDs come in two sizes, 5 gigabytes and 9 gigabytes. In general, commercial media, such as film and television programs, are released on 9-gigabyte discs while 5-gigabyte discs are more popular among developers of educational and institutional titles. One reason for the difference is that 5-gigabyte discs can be burned at the desktop while 9-gigabyte discs can only be created using more expensive mastering and duplication equipment. They are therefore less practical for small developers. However, either disc size is certainly adequate for accommodating audio navigation, both for the digitized audio clips and for the additional menus and code required to implement the system. [...]

The conclusion is clear: adding a talking menu does increase the total size of the menu system dramatically, but it shouldn’t come close to squeezing out the intended content of the disc. It therefore presents no real barrier to implementation.

In fact, there are so many variations of DVD layers and sides that DVD Demystified uses four pages to describe them (pp. 183–186). Ralph LaBarge, in DVD Authoring & Production, also needs four pages (pp. 10–13). The main variations listed are, in fact, DVD-5, -10, -9 (common and uncommon kinds), -14, and -18. The numbers refer to the gigabytes on the disc (rounded off).

While DVD-5s and -9s are most common in commercial DVD titles, the guidelines have no business categorically stating that “either disc size is certainly adequate for accommodating audio navigation.” Every disc is different, and a lot of them are packed right to the walls with features that authors, or their bosses, absolutely will not sacrifice. And the “conclusion” the guidelines draw is patently false, and stated so forecefully that WGBH must have been overcompensating.

Read more