Copyright © 2021 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and permissive document license rules apply.
This document summarizes relevant research, then outlines accessibility-related user needs and associated requirements for the synchronization of audio and visual media. The scope of the discussion includes synchronization of accessibility-related components of multimedia, such as captions, sign language interpretation, and descriptions. The requirements identified herein are applicable to multimedia content in general, as well as real-time communication applications and media occurring in immersive environments.
The purpose of this document is to identify and to characterize synchronization-related needs. It does not constitute normative guidance. It may, nevertheless, influence the further development of W3C specifications, including accessibility guidelines and media-related technologies. It may also be applied to the development of multimedia content and applications to enhance accessibility.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document was published by the Accessible Platform Architectures Working Group as a First Public Working Draft.
To comment, file an issue in the W3C APA GitHub repository. If this is not feasible, send email to public-apa@w3.org (subscribe, archives). Comments are requested by 5th November 2021.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 15 September 2020 W3C Process Document.
In accessible multimedia content, a variety of resources may be presented concurrently. These resources can include a video track, an audio track, captions, video descriptions, and sign language interpretation of the audio track. To ensure equality of access for all users, including those with a variety of disabilities and associated needs, these concurrent resources should be appropriately synchronized. For example, adequate synchronization of the audio and video tracks is necessary to support users who are hard of hearing, and who rely on lip reading to understand spoken content. Users who have difficulty hearing for situational reasons (e.g., due to a noisy environment) also benefit.
This document addresses the question of what qualifies as sufficient synchronization of the different media resources that may be used in accessible content. The considerations that bear on this question are different depending on the resources involved (audio and video tracks, captions, sign language interpretation, etc.). Likewise, the applicable constraints vary according to whether the multimedia content is presented in real time (as in a video conference or a live event), or prerecorded.
Adequate synchronization benefits all users. Consequently, the research surveyed in this document is of general importance to the quality of multimedia for all user populations. It is especially significant, however, to users with disabilities who need alternative media resources, such as captions, descriptions, and sign language interpretation.
The adequacy of media synchronization can significantly affect the accessibility of content. For example, for a person who uses captions to follow the progression of a video successfully, a correspondence should be maintained between the captions and the visual track (both of which the user is watching concurrently). This can be accomplished by limiting the delay between the spoken dialogue and the presentation of the captions. The issue of what delay should be regarded as acceptable in such a case is addressed in this document with respect to a variety of media resources.
More precisely, one media track can be ahead of or behind another media track by a specific time interval. The problem of adequate synchronization can be understood as that of defining the appropriate tolerances for the relationship between different kinds of media resources, such as audio and video tracks. The purpose of limiting the acceptable time window is to facilitate comprehension of the material. Insufficient synchronization leads to a corresponding loss in comprehension.
Issues of media synchronization are relevant to multiple aspects of Web technology. These aspects include the design and implementation of Web standards for media synchronization (e.g., Timed Text), the authoring tools with which accessible multimedia content is created, and the Web applications through which it is presented to the user. This document can serve as a point of reference for the development of each of these technologies. It can also inform the further evolution of standards for Web accessibility, and indeed for multimedia accessibility in general.
This document is closely related to other publications developed by the W3C's Web Accessibility Initiative. Normative guidance concerning the accessibility of multimedia is given in Web Content Accessibility Guidelines (WCAG) 2.1 [wcag21]. Detailed, non-normative guidance to the accessibility-related aspects of multimedia content is presented in the Media Accessibility User Requirements (MAUR) [media-accessibility-reqs], a document that identifies users' needs and associated solutions.
Synchronized media can occur in immersive environments, including virtual reality and augmented reality. The XR Accessibility User Requirements (XAUR) [xaur] should be consulted for guidance concerning the accessibility of these technologies. Similarly, synchronized media can arise in real-time communication applications, such as remote meeting environments. The accessibility-related user needs and associated system requirements applicable to these applications are considered in RTC Accessibility User Requirements (RAUR) [raur]. The present document should be regarded as complementing each of these publications by examining a specific aspect of media quality and accessibility.
Research in the field of human speech perception underscores the fact that speech perception is routinely bimodal in nature, and depends not only on acoustic cues in human speech but also on visual cues such as lip movements and facial expressions. Due to this bimodality in speech perception, audio-visual interaction becomes an important design factor for multimodal communication systems, such as video telephony and video conferencing. (Chen & Rao, 1998)
It has been observed that humans use their sight to assist in aural communication. This has been found to be especially true in helping to separate speech from background noise by supplying a supplemental visual information source, which is useful when the listener has trouble comprehending the acoustic speech. Past research has shown that in such situations, even people who are not hard of hearing depend upon such visual cues to some extent (Summerfield, 1992). Access to robust visual speech information has been shown to lead to significant improvement in speech recognition in noisy environments. For instance, one study found that when only acoustic access to speech was available, auditory recognition was near 100% at 0 dB of signal-to-noise ratio (SNR) but fell to under 20% at minus 30 dB SNR. However, this study found that when visual access to the speaker was included, recognition only dropped from 100% to 90% over the same range (Sumby & Pollack, 1954). More recent studies have shown that when sighted people are attempting to listen to speech in high noise audiovisual samples, the result is greater visual fixations on the mouth of the speaker (Yi, Wong, & Eizenman, 2013) and stronger synchronizations between the auditory and visual motion/motor brain regions (Alho et al., 2014).
A similar reliance on visual cues to help decode speech may also be at work in other instances where volume of the speaker's voice begins to degrade, such as while listening to a lecture in a large hall. Due to the fact that light travels at a much higher speed than sound, in a face-to-face setting a person will see a speaker’s lips and facial gestures sooner than the sound of the speaker’s voice arrives. In a normal in-person conversation this difference is negligible. However, as the distance increases, such as a student listening to an instructor in the classroom, this time lag will increase. For instance, at 22 feet, this difference is roughly 20 ms. At the same time, the listener’s perceived volume of a speakers voice drops with the distance traveled, which means the listener will rely more on visual cues. Indeed, experimental research has demonstrated that the ability to comprehend speech at increasing distances is improved when both audio and visual speech is available to the listener (Jordan & Sergeant, 2000). Such findings suggest that robust synchronized video along with the audio of speakers in virtual environments are likely to increase speech comprehension for hard of hearing listeners.
One important concern in audiovisual integration of speech audio and visual information is how closely these events are synchronized. Given that the observable facial movement for phoneme production can precede acoustic information by 100–200 ms, the temporal order of both the sensory input and electrophysiological effects suggests that visual speech information may provide predictions about upcoming auditory input. This fact likely explains why research has found that test subjects are less likely to notice minor auditory lags in audiovisual presentation of human speech than when the audio signal arrives first (Peelle & Sommers, 2015). As a result, several standards bodied have attempted to set synchronization specifications for audiovisual broadcasting which typically provide a +/- threshold where audio lag is much less restrictive than video lag. Typically, these thresholds are more restrictive for higher quality signals, such as those for digital high definition television broadcasting. Case in point, the recognized industry standard adopted by the ATSC Implementation Subcommittee, the DSL Forum, and the ITU-T Recommendation G.1080, all include an audio/video delay threshold between plus 15 ms and minus 45 ms (Staelens et al, 2012). This means that having the audio arrive up to 45 ms after the video is considered acceptable, but having the audio signal arrive more than 15 ms before the video is objectionable. Consistently with this approach, the EN 301 549 information and communication technology public procurement standard [en-301-549] specifies a maximum time difference of 100 ms between the audio and the video, noting that the decline in intelligibility is greater if the audio is ahead of, rather than behind, the video track.
However, it is important to note that most audiovisual media in everyday life does not meet the capabilities of high definition television. Further, most experimental studies which have attempted to examine issues around lip video synchronization with audio have been conducted with standard video recording capabilities which are typically limited to a frame rate of 25 frames per second (fps). At 25 fps, there will be one frame every 40 ms, and as a result it becomes impossible to test synchronization errors below this time threshold (Ivanko et al, 2018). Studies using high quality audiovisual content at much faster frame rates have shown that lip synchronization mismatch of 20 ms or less is imperceptible (Firestone, 2007). However, studies conducted on speech intelligibility when audio quality is degraded in such a way as to simulate age-related hearing loss have shown that when the audio signal leads the video, intelligibility declines appreciably for even the shortest asynchrony of 40 ms, but when the video signal leads the audio, intelligibility remains relatively stable for onset asynchronies up to 160 - 200 ms (Grant & Greenberg, 2001). These findings suggest that, from an accessibility perspective, the audio signal should not be ahead of the video by more than 40 ms, and the video should not be ahead of the audio by more than 160 ms. However, less than 160 ms offset is desirable due to the fact that this much of a delay would be detectable and potentially objectionable to a percentage of the population, even though it would not present an accessibility barrier as such.
While the use of closed captions in both live and prerecorded video has become widespread, the use of a human signer to provide interpretation of spoken content in media is not nearly as prevalent. In some cases, broadcasters have argued that captioning is more cost effective and reaches a larger audience of users, such as hard-of-hearing and late-deafened individuals who are not literate in sign language. However, the Deaf community has long advocated for increased availability to sign language interpretation as better meeting their access needs (Bosch-Baliarda, Soler-Vilageliu & Orero 2020). And while significant research and development work has been directed toward automated sign language translation using computer-generated signing avatars, this work is still behind the current state of automated speech recognition captioning technology (Bragg et al, 2019).
Due to the fact that sign languages have their own grammars which are not necessarily aligned to the written form of the associated spoken language, it is not possible to provide a word-by-word rendering as is done with captioning, and thus uniform synchronization with spoken audio will not be possible. Indeed, in practice a sign language interpreter will often need to wait for some few seconds to allow for an understanding of more complete spoken phrasing before starting to interpret in sign. The amount of onset time lag may vary widely depending upon the particular spoken language and the particular target sign language source.
In a 1983 study by Cokely, researchers found that an increased lag time actually enhanced the overall comprehension of the spoken dialogue and allowed the sign language interpreter to convey a more accurate rendering of what was spoken (Cokely, 1986). In this study, it was found that the number of translation errors (i.e., various types of translation miscues) decreases as the lag time of the interpreters increases. For examples, the interpreters in their study with a 2-second lag time had more than twice the total number of miscues of the interpreters with a 4-second lag, who in turn had almost twice as many miscues as those with a 6-second lag. The researchers cautioned, however, that this does not mean there is no upper limit to lag time and reasoned that it is likely there is lag time threshold beyond which the number of translation omissions would significantly increase because the threshold is at the upper limits of the individual's short-term working memory. Nonetheless, the findings of this study point out that providing close synchronization of sign language interpretation to what is being spoken may be counterproductive. In this case, some users may prefer finding a happy medium between the user need for immediacy in remote meetings and the user need for accuracy, while others may prefer the greatest accuracy possible even at the expense of immediacy.
Video description (sometimes referred to as “audio description”) typically adds spoken narration of important visual elements in video streams such as TV programs and movies. Beginning in the early 1990s, the ability to transmit and receive audio descriptions in TV programming over a Separate Audio Program (SAP) channel became available (Cronin & King, 1990). Video description is most commonly applied to prerecorded media, although its application to live events (especially the performing arts) is also a common use case (Di Giovanni, 2018).
One of the most difficult considerations of video description creation is the need to avoid conflicts with the primary speech dialogue. In prerecorded media, a complete transcript of the spoken dialogue with timings is commonly loaded into video description editing software, although in some cases a simple spreadsheet may be used for smaller projects. The next step is typically the identification and inclusion of onscreen events, music and sound effect cues in the media time stream. Editing software may apply an algorithm that calculates the ideal duration of the description entered into the available open space time slot, as well as the minimum and maximum tolerated deviation from the ideal reading rate. In order to do that, the algorithm needs to be given reading rate values upon which it can calculate (Jankowska et al, 2017). In practice, the description of an on-scene event may need to begin several seconds before the event occurs to avoid audio conflicts.
Live events, however, are more difficult to manage than prerecorded media due to the spontaneity of an event happening in real time. While video description can often be scripted during rehearsals of performances and thus made available in real time during a performance, adding video description to a live event which is not rehearsed is much more difficult because there is no pre-event information as to the availability and duration of open slots in the live audio stream. One method used to address this scenario in broadcasting of live evens is “near real-time” video description (Boyce, et al, ND). In near real-time broadcasting, the live event is recorded and transmission is typically delayed within the range of 10 to 60 seconds. This allows time for the system to look ahead and analyze the upcoming portion of the video for silent periods and provides a brief span of time for the describer to insert the narration. While near real-time video description may work well in mainstream media broadcasts, it would be impractical for online participatory events, meetings and group discussions. In such cases, the generally accepted best practice is for participants to always describe visual aspects of content or on-camera actions they may be demonstrating as closely in sync as possible with the visual information.
Do users' needs, and the acceptable synchronization tolerances applicable to immersive environments (e.g., virtual reality, augmented reality, and 360-degree video) differ from what is encountered in multimedia in general? What research, if any, has been undertaken into such differences between synchronization in immersive and non-immersive media? If there are specific synchronization issues relevant to immersive environments, they will be documented in this section. The Research Questions Task Force and the Accessible Platform Architectures working Group invite comments regarding these issues to inform further development of the draft.