Synchronization Accessibility User Requirements

Abstract

This document summarizes relevant research, then outlines accessibility-related user needs and associated requirements for the synchronization of audio and visual media. The scope of the discussion includes synchronization of accessibility-related components of multimedia, such as captions, sign language interpretation, and descriptions. The requirements identified herein are applicable to multimedia content in general, as well as real-time communication applications and media occurring in immersive environments.

The purpose of this document is to identify and to characterize synchronization-related needs. It does not constitute normative guidance. It may, nevertheless, influence the further development of W3C specifications, including accessibility guidelines and media-related technologies. It may also be applied to the development of multimedia content and applications to enhance accessibility.

Issues and Opportunities Identified in the Literature

Lip Reading Use Case Synchronization

Research in the field of human speech perception underscores the fact that speech perception is routinely bimodal in nature, and depends not only on acoustic cues in human speech but also on visual cues such as lip movements and facial expressions. Due to this bimodality in speech perception, audio-visual interaction becomes an important design factor for multimodal communication systems, such as video telephony and video conferencing. (Chen & Rao, 1998)

It has been observed that humans use their sight to assist in aural communication. This has been found to be especially true in helping to separate speech from background noise by supplying a supplemental visual information source, which is useful when the listener has trouble comprehending the acoustic speech. Past research has shown that in such situations, even people who are not hard of hearing depend upon such visual cues to some extent (Summerfield, 1992). Access to robust visual speech information has been shown to lead to significant improvement in speech recognition in noisy environments. For instance, one study found that when only acoustic access to speech was available, auditory recognition was near 100% at 0 dB of signal-to-noise ratio (SNR) but fell to under 20% at minus 30 dB SNR. However, this study found that when visual access to the speaker was included, recognition only dropped from 100% to 90% over the same range (Sumby & Pollack, 1954). More recent studies have shown that when sighted people are attempting to listen to speech in high noise audiovisual samples, the result is greater visual fixations on the mouth of the speaker (Yi, Wong, & Eizenman, 2013) and stronger synchronizations between the auditory and visual motion/motor brain regions (Alho et al., 2014).

A similar reliance on visual cues to help decode speech may also be at work in other instances where volume of the speaker's voice begins to degrade, such as while listening to a lecture in a large hall. Due to the fact that light travels at a much higher speed than sound, in a face-to-face setting a person will see a speaker’s lips and facial gestures sooner than the sound of the speaker’s voice arrives. In a normal in-person conversation this difference is negligible. However, as the distance increases, such as a student listening to an instructor in the classroom, this time lag will increase. For instance, at 22 feet, this difference is roughly 20 ms. At the same time, the listener’s perceived volume of a speakers voice drops with the distance traveled, which means the listener will rely more on visual cues. Indeed, experimental research has demonstrated that the ability to comprehend speech at increasing distances is improved when both audio and visual speech is available to the listener (Jordan & Sergeant, 2000). Such findings suggest that robust synchronized video along with the audio of speakers in virtual environments are likely to increase speech comprehension for hard of hearing listeners.

One important concern in audiovisual integration of speech audio and visual information is how closely these events are synchronized. Given that the observable facial movement for phoneme production can precede acoustic information by 100–200 ms, the temporal order of both the sensory input and electrophysiological effects suggests that visual speech information may provide predictions about upcoming auditory input. This fact likely explains why research has found that test subjects are less likely to notice minor auditory lags in audiovisual presentation of human speech than when the audio signal arrives first (Peelle & Sommers, 2015). As a result, several standards bodied have attempted to set synchronization specifications for audiovisual broadcasting which typically provide a +/- threshold where audio lag is much less restrictive than video lag. Typically, these thresholds are more restrictive for higher quality signals, such as those for digital high definition television broadcasting. Case in point, the recognized industry standard adopted by the ATSC Implementation Subcommittee, the DSL Forum, and the ITU-T Recommendation G.1080, all include an audio/video delay threshold between plus 15 ms and minus 45 ms (Staelens et al, 2012). This means that having the audio arrive up to 45 ms after the video is considered acceptable, but having the audio signal arrive more than 15 ms before the video is objectionable. Consistently with this approach, the EN 301 549 information and communication technology public procurement standard [[en-301-549]] specifies a maximum time difference of 100 ms between the audio and the video, noting that the decline in intelligibility is greater if the audio is ahead of, rather than behind, the video track.

However, it is important to note that most audiovisual media in everyday life does not meet the capabilities of high definition television. Further, most experimental studies which have attempted to examine issues around lip video synchronization with audio have been conducted with standard video recording capabilities which are typically limited to a frame rate of 25 frames per second (fps). At 25 fps, there will be one frame every 40 ms, and as a result it becomes impossible to test synchronization errors below this time threshold (Ivanko et al, 2018). Studies using high quality audiovisual content at much faster frame rates have shown that lip synchronization mismatch of 20 ms or less is imperceptible (Firestone, 2007). However, studies conducted on speech intelligibility when audio quality is degraded in such a way as to simulate age-related hearing loss have shown that when the audio signal leads the video, intelligibility declines appreciably for even the shortest asynchrony of 40 ms, but when the video signal leads the audio, intelligibility remains relatively stable for onset asynchronies up to 160 - 200 ms (Grant & Greenberg, 2001). These findings suggest that, from an accessibility perspective, the audio signal should not be ahead of the video by more than 40 ms, and the video should not be ahead of the audio by more than 160 ms. However, less than 160 ms offset is desirable due to the fact that this much of a delay would be detectable and potentially objectionable to a percentage of the population, even though it would not present an accessibility barrier as such.

Caption Synchronization

Captions used for accessibility purposes (also more commonly known as "subtitles" in some countries) have been in common usage in the broadcast industry for several decades. Some of the critical issues related to captioning which have been examined in research include the caption rate, the quality of caption text (including aspects such as caption text accuracy, verbatim vs. edited captions, identification on multiple speakers, and the use of punctuation and capitalization), as well as the synchronization of caption text with audio and visual information.

Caption Rate

Caption rate has been a major topic for the broadcast industry. In a White Paper published by BBC Research & Development (Sandford, 2015), the author summarized the various guidelines in use among broadcasters which often include both optimal and maximum rates for captions. Figures of approximately 140 Words per Minute (WPM) as the optimum subtitle (i.e., caption) rate, and around 180-200 WPM as the maximum rate were found to be common. However, the conclusion of the author was that the guidelines examined "fail to cite research supporting these figures but justify them by stating that above these rates, subtitles will be difficult to follow." Sandford further noted that previous research has shown that reading comprehension of captions remain fairly stable up to at least a rate of 230 WPM, which seemed to call into question the maximum rates used in most guidelines to that point in time, and served as the impetus for new research conducted by the BBC.

The BBC research study on caption rates was conducted in two phases. The first phase of the study included video clips which were purposefully created for the study, where BBC reporters attempted to recreate broadcast quality news pieces on the same topic, but re-scripted each clip in the study so that it included more or less words which were spoken over the same 30 second period of time. In this way, a range of WPM caption rates were created while all other aspects of the clip remained the same, except that they created two types of captions, one with scrolling captions and one with block captions. This series of clips at different rates were then shown to test subjects, which included two main groups. One group of testers included deaf and hard-of-hearing viewers who viewed the video clips with captions (both scrolling and block), while a comparison group of hearing viewers viewed the same series of clips without any captions at all. The purpose of the comparison group of hearing viewers was to help gauge how much of the impact on perceived good and bad rates of captions may be due to how quickly the speaker is talking, as opposed to how fast the words appear in captions.

Their results for this phase of the study showed that the range of rates between what subjects considered "too fast" and "too slow" was widest for block subtitles and narrowest for speech alone. The analysis revealed that:

The average rate of clips perceived as "slow" came in at 112 WPM for block captions, 115 WPM for scrolling captions, and 121 WPM for speech alone.
The optimal "good" rate averaged 177 WPM for block captions, 171 for scrolling captions, and 170 WPM for speech alone.
The average rate of clips perceived as "fast" came in at 242 for block captions, 227 for scrolling captions, and 219 for speech alone.

However, the researchers concluded that overall similarity between all of the results demonstrated that the WPM rate of the caption text was not an independent factor for the study subjects' perception of rates that are too slow or too fast. Generally speaking, it was found that when the rate of speech was perceived as too fast or too slow by hearing viewers, this same range of rates for caption text was likely to be similarly perceived as too fast or too slow by viewers who were deaf and hard of hearing. Indeed, if anything, this set of data seems to suggest that--at least among the sample of subjects in this study--hearing viewers are more critically attentive to word rates in spoken audio than deaf and hard-of-hearing users are for word rates in captioned text for the same content. Overall, the researchers concluded that, "We found no problems associated with the rate of subtitles when they matched natural speech, regardless of the rate in words per minute."

During the second phase of the BBC study, researchers collected a number of sample clips from eight different examples of television programming "in the wild" which were above a 200 WPM rate, and presented them only to the deaf and hard-of-hearing study subjects. The expectation based on phase one of the study was that these higher rate clips would be more likely to be perceived as faster than optimal. However, the results showed that this was not the case, and that the mean perceived rates for all clips were closer to "good" and well under the "fast" rate than would have been predicted based on the findings of phase one of the study. Nonetheless, there were some telling distinctions in perception ratings based upon the type of programming. One case in point can be made by comparing ratings for two television episode clips which were nearly identical in word rate, but had a relatively wide spread in perception of how close to an optimal "enjoyable" rate, based on study subjects' numeric scores on a Likert scale. In this comparison, a clip from the talk show Top Gear with a rate of 256 WPM received a Likert scale score of 3.40 for an enjoyable rate (where 5 would be considered at the top of the "enjoyable" word rate scale), while a clip the cooking show Kitchen with an almost identical rate of 259 WPM received a lower Likert scale score of 2.34--more than a full point below the Top Gear clip, and lower than any other clip among the eight television episodes reviewed in the study.

While the BBC researchers in this study did not interview subjects to get additional qualitative details on subjects' ratings for individual clips, one likely conclusion is that the perception of a good or enjoyable rate for captions is tied to some extent to the type of content, and the way the viewer may plan to use the information gleaned from watching it. Whereas a typical viewer of Top Gear is likely to be more interested in the general entertainment value of watching the show, someone viewing a cooking show is more likely to be interested in actually using the information by cooking the dish being prepared on the screen. In the latter case, the viewer will often be very interested in specific details about ingredients, amounts and the cooking process. This consideration is likely very important in the context of educational video programming, in particular, where the desire is that students will comprehend and retain key facts from video programming. And although the researchers did not study these "in the wild" samples with hearing viewers, the findings from phase one of their study would logically point back to the underlying issue of speech rate in the audio stream to begin with, and suggests that media producers be careful to regulate the speed of speakers in media materials, especially when the voice content has a high information density.

Another study by the BBC aimed to measure the perceived quality of television captions based on guidelines for measuring perceived audio quality. The objective was to estimate the relative impact of reduced delay in the appearance of live captions vs an increase in accuracy. Participants were regular users of captions, but were not asked to disclose their hearing ability. Reduced delay in the presentation of word-by-word captions was more strongly associated with a perception of improved quality for participants watching with sound on than for those with sound turned off. On the other hand, improvement in caption accuracy (namely, a lower word error rate) was significantly associated with a perception of improved quality only among participants who viewed the material without sound (Armstrong, 2013).

Captions in Live Media

There are three main transcription methods for live captions. ASR (Automatic Speech Recognition), ASR with revoicing, and STTR (Speech To Text Reporting). ASR is a fully automated transcription process where a computer converts the speech into text. ASR with revoicing utilises a human intermediary who repeats everything that is spoken to an ASR system that is trained to their voice. STTR involves specially trained typists phonetically transcribing what is said using chord keyboards. The two main types of these keyboards are Palantype and Stenotype and Stenotype is the most common. Cost and accuracy both increase relative to how much of the work is done by a human.

Captions in live media and remote meetings will be inherently delayed due to the necessary time lag for speech to be transcribed into text captions. Even automated captions produced by ASR systems require some amount of time to process human speech into text which then must be integrated into the video stream. The use of human transcribers to create captions, while typically resulting in captions of much greater accuracy, will usually dictate an even greater latency between the sound of the speaker's voice and the displayed captions. The understanding of what is considered an "acceptable" amount of time delay will often hinge on the type of live media, and what is considered the proper level of transcription accuracy. For instance, in the case of live speeches and broadcast entertainment, media outlets have adopted caption latency standards ranging from a target as short as 3 seconds to as long as "less than 10 seconds" -- the latter case "reflecting a greater emphasis on ensuring that spelling and punctuation are correct" (Mikul, 2014). The conclusion of Mikul is that a target latency of 5 seconds is appropriate and achievable in most cases for live broadcast media, and that this target applies to the average time lag over the length of the program.

However, caption time lag in remote meetings must be considered in a different light, as the participatory nature of meetings dictate that the immediacy of captioned text must take some degree of precedence over spelling and punctuation accuracy. While both accuracy and immediacy are vital criteria in any setting, having captions delayed for an inordinate amount of time during a remote meeting scenario puts the deaf or hard of hearing meeting participant at a significant disadvantage during a fast-moving discussion. To better address the need for immediacy of captions in remote meetings, most popular online meeting platforms have integrated automatic captioning utilizing Automatic Speech Recognition (ASR).

The accuracy of ASR can vary greatly due to the influence of factors that tend to introduce recognition errors. These factors include speaker variability, reflecting for example illness, fatigue or emotional state, differences of dialect and accent, as well as discrepancies between the audio characteristics of the speech samples used to train the system and the speech which is to be recognized, such as the presence of background noise (Errattahi, El Hannani, & Ouahmane, 2018). Nevertheless, the accuracy of ASR systems has markedly improved in recent years. Indeed, very recent studies have demonstrated that, under favorable conditions, the best ASR systems can rival human accuracy on the average while also decreasing captioning latency to well below the typical human captioning ability. A 2020 study comparing ASR-based captioning systems revealed that the Google enhanced API had a stable-hypothesis latency (the time between the utterance of a word and the output of correct text) of only 0.761 seconds, while maintaining a Word Error Rate (WER) of only 0.06 (Jiline et al, 2020). The authors then compared this to an average latency of 4.2 seconds for human based captioning and a WER between 0.04 and 0.09, based on generalized results from multiple academic sources. While Google's enhanced ASR API was by far the best in this study comparison, its performance illustrates the growing capability of machine learning to enhance the ability of remote meeting platforms to provide accurate captions in a timely manner.

The Task Force plans to provide here a more adequate characterization of the trade-off between caption latency and the accuracy of automatic speech recognition systems, compared with that of human transcription. Review and comments on this point are invited.

The Task Force is also aware of recent decisions by the U.S. Federal Communications Commission granting conditional certification for ASR generated captions in Internet telephony. See Federal Communications Commission (2018) for the legal background to these decisions. This and similar research or regulatory activities will be of interest for subsequent drafts of this document.

Sign Language Interpretation Synchronization

While the use of closed captions in both live and prerecorded video has become widespread, the use of a human signer to provide interpretation of spoken content in media is not nearly as prevalent. In some cases, broadcasters have argued that captioning is more cost effective and reaches a larger audience of users, such as hard-of-hearing and late-deafened individuals who are not literate in sign language. However, the Deaf community has long advocated for increased availability to sign language interpretation as better meeting their access needs (Bosch-Baliarda, Soler-Vilageliu & Orero 2020). And while significant research and development work has been directed toward automated sign language translation using computer-generated signing avatars, this work is still behind the current state of automated speech recognition captioning technology (Bragg et al, 2019).

Due to the fact that sign languages have their own grammars which are not necessarily aligned to the written form of the associated spoken language, it is not possible to provide a word-by-word rendering as is done with captioning, and thus uniform synchronization with spoken audio will not be possible. Indeed, in practice a sign language interpreter will often need to wait for some few seconds to allow for an understanding of more complete spoken phrasing before starting to interpret in sign. The amount of onset time lag may vary widely depending upon the particular spoken language and the particular target sign language source.

In a 1983 study by Cokely, researchers found that an increased lag time actually enhanced the overall comprehension of the spoken dialogue and allowed the sign language interpreter to convey a more accurate rendering of what was spoken (Cokely, 1986). In this study, it was found that the number of translation errors (i.e., various types of translation miscues) decreases as the lag time of the interpreters increases. For examples, the interpreters in their study with a 2-second lag time had more than twice the total number of miscues of the interpreters with a 4-second lag, who in turn had almost twice as many miscues as those with a 6-second lag. The researchers cautioned, however, that this does not mean there is no upper limit to lag time and reasoned that it is likely there is lag time threshold beyond which the number of translation omissions would significantly increase because the threshold is at the upper limits of the individual's short-term working memory. Nonetheless, the findings of this study point out that providing close synchronization of sign language interpretation to what is being spoken may be counterproductive. In this case, some users may prefer finding a happy medium between the user need for immediacy in remote meetings and the user need for accuracy, while others may prefer the greatest accuracy possible even at the expense of immediacy.

Video Description Synchronization

Video description (sometimes referred to as “audio description”) typically adds spoken narration of important visual elements in video streams such as TV programs and movies. Beginning in the early 1990s, the ability to transmit and receive audio descriptions in TV programming over a Separate Audio Program (SAP) channel became available (Cronin & King, 1990). Video description is most commonly applied to prerecorded media, although its application to live events (especially the performing arts) is also a common use case (Di Giovanni, 2018).

Video description can be delivered by two alternative means:

As an audio track synchronized with the visual track, or
as text for presentation to the user via a text to speech system or a braille device. This is referred to in the Media Accessibility User Requirements [[media-accessibility-reqs]] as "video text description".

Whereas the first option requires synchronization of an audio track with the video, the second option is dependent for synchronization on the user's preferences (e.g., speech rate settings). Only the start time of the cue for each description is supplied by the media provider; the end time varies according to the user's local preferences, and may necessitate automatic pausing of the video and audio tracks of the media resource to accommodate the reading of a description. See Media Accessibility User Requirements [[media-accessibility-reqs]], section 2.2, for further details.

Since control of synchronization resides with the creator of the media resource if video description is provided as a supplemental audio track, only this case is considered further in the discussion that follows. The synchronization of video text description can be addressed by an appropriate implementation of client-side software that respects the user's preferences (including speech rate for text to speech systems, and scrolling behavior for braille devices).

One of the most difficult considerations of video description creation is the need to avoid conflicts with the primary speech dialogue. In prerecorded media, a complete transcript of the spoken dialogue with timings is commonly loaded into video description editing software, although in some cases a simple spreadsheet may be used for smaller projects. The next step is typically the identification and inclusion of onscreen events, music and sound effect cues in the media time stream. Editing software may apply an algorithm that calculates the ideal duration of the description entered into the available open space time slot, as well as the minimum and maximum tolerated deviation from the ideal reading rate. In order to do that, the algorithm needs to be given reading rate values upon which it can calculate (Jankowska et al, 2017). In practice, the description of an on-scene event may need to begin several seconds before the event occurs to avoid audio conflicts.

Live events, however, are more difficult to manage than prerecorded media due to the spontaneity of an event happening in real time. While video description can often be scripted during rehearsals of performances and thus made available in real time during a performance, adding video description to a live event which is not rehearsed is much more difficult because there is no pre-event information as to the availability and duration of open slots in the live audio stream. One method used to address this scenario in broadcasting of live evens is “near real-time” video description (Boyce, et al, ND). In near real-time broadcasting, the live event is recorded and transmission is typically delayed within the range of 10 to 60 seconds. This allows time for the system to look ahead and analyze the upcoming portion of the video for silent periods and provides a brief span of time for the describer to insert the narration. While near real-time video description may work well in mainstream media broadcasts, it would be impractical for online participatory events, meetings and group discussions. In such cases, the generally accepted best practice is for participants to always describe visual aspects of content or on-camera actions they may be demonstrating as closely in sync as possible with the visual information.

XR Environment Synchronization

Do users' needs, and the acceptable synchronization tolerances applicable to immersive environments (e.g., virtual reality, augmented reality, and 360-degree video) differ from what is encountered in multimedia in general? What research, if any, has been undertaken into such differences between synchronization in immersive and non-immersive media? If there are specific synchronization issues relevant to immersive environments, they will be documented in this section. The Research Questions Task Force and the Accessible Platform Architectures working Group invite comments regarding these issues to inform further development of the draft.

References

Alho, J., Lin, F. H., Sato, M., Tiitinen, H., Sams, M., & Jääskeläinen, I. P. (2014). Enhanced neural synchrony between left auditory and premotor cortex is associated with successful phonetic categorization. Frontiers in Psychology, 5, 394.
Armstrong, M. (Oct 2013). The Development of a Methodology to Evaluate the Perceived Quality of Live TV Subtitles. BBC Research & Development White Paper WHP 259
Berke, L., Caulfield, C., & Huenerfauth, M. (2017, October). Deaf and hard-of-hearing perspectives on imperfect automatic speech recognition for captioning one-on-one meetings. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (pp. 155-164).
Blakowski, G., & Steinmetz, R. (1996). A media synchronization survey: Reference model, specification, and case studies. IEEE journal on selected areas in communications, 14(1), 5-35.
Bosch-Baliarda, Marta; Olga Soler-Vilageliu & Pilar Orero. (2020) “Sign language interpreting on TV: a reception study of visual screen exploration in deaf signing users.” In: Richart-Marset, Mabel & Francesca Calamita (eds.) 2020. Traducción y Accesibilidad en los medios de comunicación: de la teoría a la práctica / Translation and Media Accessibility: from Theory to Practice. MonTI 12, pp. 108-143. Esta obra está bajo una licencia de Creative Commons Reconocimiento 4.0 Internacional.
Boyce, M., Diamond, S., Fels, D., Gadsby, E., Harvie, R., Porch, W., ... & Treviranus, J. (N.D.). Canadian Network for Inclusive Cultural Exchange (CNICE).
Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., ... & Ringel Morris, M. (2019, October). Sign language recognition, generation, and translation: An interdisciplinary perspective. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility (pp. 16-31).
Burnham, D., Robert-Ribes, J., & Ellison, R. (1998). Why captions have to be on time. In AVSP'98 International Conference on Auditory-Visual Speech Processing.
Chen, M. (2003, April). A low-latency lip-synchronized videoconferencing system. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 465-471).
Chen, T., & Rao, R. R. (1998). Audio-visual integration in multimodal communication. Proceedings of the IEEE, 86(5), 837-852.
Cokely, D. (1986). The effects of lag time on interpreter errors. Sign Language Studies, 341-375.
Cronin, B. J., & King, S. R. (1990). The Development of the Descriptive Video Services. Journal of Visual Impairment & Blindness, 84(10), 503-506.
Cuzco-Calle, I., Ingavélez-Guerra, P., Robles-Bykbaev, V., & Calle-López, D. (2018, August). An interactive system to automatically generate video summaries and perform subtitles synchronization for persons with hearing loss. In 2018 IEEE XXV International Conference on Electronics, Electrical Engineering and Computing (INTERCON) (pp. 1-4). IEEE.
De Araújo, T. M. U., Ferreira, F. L., Silva, D. A., Oliveira, L. D., Falcão, E. L., Domingues, L. A., ... & Duarte, A. N. (2014). An approach to generate and embed sign language video tracks into multimedia contents. Information Sciences, 281, 762-780
Díaz-Cintas, J., Orero, P., & Remael, A. (Eds.). (2007). Media for all: subtitling for the deaf, video description, and sign language (Vol. 30). Rodopi.
Di Giovanni, E. (2018). Audio description for live performances and audience participation. Jostrans: the Journal of Specialised Translation, 29, 189-211.
Errattahi, R., El Hannani, A., & Ouahmane, H. (2018). Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128, 32-37.
Federal Communications Commission. (2018). Declaratory Ruling on Automatic Speech Recognition. In Report and Order, Declaratory Ruling, Further Notice of Proposed Rulemaking, and Notice of Inquiry (33 FCC Rcd 5800 (9), 5827, para. 48).
Firestone, S. (2007). Lip Synchronization in Video Conferencing. Voice and Video Conferencing Fundamentals. Cisco Systems, Inc.
Garcia, J. E., Ortega, A., Lleida, E., Lozano, T., Bernues, E., & Sanchez, D. (2009, May). Audio and text synchronization for TV news subtitling based on automatic speech recognition. In 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (pp. 1-6). IEEE.
Grant, K. W., & Greenberg, S. (2001). Speech intelligibility derived from asynchronous processing of auditory-visual information. In AVSP 2001-International Conference on Auditory-Visual Speech Processing.
Han, H. H., & Yu, H. N. (2020). An empirical study of temporal variables and their correlations in spoken and sign language relay interpreting. Babel, 66(4-5), 619-635
Huang, C. W., Hsu, W., & Chang, S. F. (2003). Automatic closed caption alignment based on speech recognition transcripts. Rapport technique, Columbia.
ITU-T, S. H. (1999). Application profile–Sign language and lip-reading real-time conversation using low bit-rate video communication. CCITT Recommendations.
Ivanko, D., Karpov, A., Fedotov, D., Kipyatkova, I., Ryumin, D., Ivanko, D., ... & Zelezny, M. (2018). Multimodal speech recognition: increasing accuracy using high speed video data. Journal on Multimodal User Interfaces, 12(4), 319-328.
Jankowska, A., ZióŁko, B., Igras-Cybulska, M., & Psiuk, A. (2017). Reading rate in filmic audio description. Rivista Internazionale di Tecnica della Traduzione= International Journal of Translation, 19.
Jiline, M., Kirk, D., Quirk, K., Sandler, M., & Monette, M. (2020) A Review Of State-Of-The-Art Automatic Speech Recognition Services For CART And CC Applications. Proceedings of the 2020 NAB Broadcast Engineering and Information Technology (BEIT) Conference, © 2020 National Association of Broadcasters, 1 M Street SE, Washington, DC 20003 USA.
Jordan, T. R., & Sergeant, P. (2000). Effects of distance on visual and audiovisual speech recognition. Language and Speech, 43(1), 107-124.
Kafle, S., & Huenerfauth, M. (2017, October). Evaluating the usability of automatically generated captions for people who are deaf or hard of hearing. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (pp. 165-174).
Kaganovich, N., Schumaker, J., & Rowland, C. (2016). Matching heard and seen speech: An ERP study of audiovisual word recognition. Brain and language, 157-158, 14–24. https://doi.org/10.1016/j.bandl.2016.04.010
Keating, E., & Mirus, G. (2003). American Sign Language in virtual space: Interactions between deaf users of computer-mediated video communication and the impact of technology on language practices. Language in Society, 693-714.
Koller, O. (2020). Quantitative survey of the state of the art in sign language recognition. arXiv preprint arXiv:2008.09918.
Kumar, P. J., Hu, W., & Yung, Y. (2017). Virtual Reality Based 3D Video Games and Speech-Lip Synchronization Superseding Algebraic Code Excited Linear Prediction. International Journal of Computer and Information Sciences, 4(12), 303-321.
Maruyama, I., Abe, Y., Sawamura, E., Mitsuhashi, T., Ehara, T., & Shirai, K. (1999). Cognitive experiments on timing lag for superimposing closed captions. In Sixth European Conference on Speech Communication and Technology.
McCarthy, J. E., & Swierenga, S. J. (2010). What we know about dyslexia and web accessibility: a research review. Universal Access in the Information Society, 9(2), 147-152.
Mikul, C. (2014). Caption quality: Approaches to standards and measurement. Media Access Australia.
Montagud, M., Cesar, P., Boronat, F., & Jansen, J. (2018). Introduction to media synchronization (MediaSync). In MediaSync (pp. 3-31). Springer, Cham.
Peelle, J. E., & Sommers, M. S. (2015). Prediction and constraint in audiovisual speech perception. Cortex; a journal devoted to the study of the nervous system and behavior, 68, 169–181. https://doi.org/10.1016/j.cortex.2015.03.006
Petrie, H. L., Weber, G., & Fisher, W. (2005). Personalization, interaction, and navigation in rich multimedia documents for print-disabled users. IBM Systems Journal, 44(3), 629-635.
Piety, P. J. (2004). The language system of audio description: an investigation as a discursive process. Journal of Visual Impairment & Blindness, 98(8), 453-469.
Sandford, J. (2015). The impact of subtitle display rate on enjoyment under normal television viewing conditions. BBC Research & Development White Paper WHP 306.
Shroyer, E. H., & Birch, J. (1980). Captions and reading rates of hearing-impaired students. American Annals of the Deaf, 125(7), 916-922.
Staelens, Nicolas & De Meulenaere, Jonas & Bleumers, Lizzy & Wallendael, Glenn & De Cock, Jan & Geeraert, Koen & Vercammen, Nick & Van den Broeck, Wendy & Vermeulen, Brecht & Van de Walle, Rik & Demeester, Piet. (2012). Assessing the importance of audio/video synchronization for simultaneous translation of video sequences. Multimedia Systems. 18. 10.1007/s00530-012-0262-4.
Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america, 26(2), 212-215.
Summerfield, Q. (1992). Lipreading and audio-visual speech perception. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 335(1273), 71-78.
Venezia, J.H., Thurman, S.M., Matchin, W. et al. Timing in audiovisual speech perception: A mini review and new psychophysical data. Attention, Perception, & Psychophysics, 78, 583–601 (2016). https://doi.org/10.3758/s13414-015-1026-y
Waters, K., & Levergood, T. (1994, October). An automatic lip-synchronization algorithm for synthetic faces. In Proceedings of The second ACM international conference on Multimedia (pp. 149-156).
Yi, A., Wong, W., & Eizenman, M. (2013). Gaze patterns and audiovisual speech enhancement. Journal of Speech, Language, and Hearing Research.
Ziegler, C., Keimel, C., Ramdhany, R., & Vinayagamoorthy, V. (2017, June). On time or not on time: A user study on delays in a synchronised companion-screen experience. In Proceedings of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video (pp. 105-114).