Specification for Spoken Presentation in HTML

Accurate and consistent pronunciation or presentation of spoken content by text to speech synthesis (TTS) is very important in many contexts, and critical in education, publishing, communication, entertainment, and other domains. TTS has become an important technology for providing access to digital content on the web. A variety of approaches have been used to address this need, ranging from improper use of the WAI-ARIA standard, to creation of custom and data-attributes, But there is no consistent approach and no interoperability to enable authoring of spoken presentaion guidance in HTML which can then be consumed by assistive technologies and other applications that utilize text to speech sythesis for rendering content. These efforts have not led to broad adoption of any standard by user agents, authors or assistive technologies. However, there are a variety of non-interoperable approaches that meet specific needs for some applications. It is currently a major challenge area, and many users are looking for standard-based solution.

This proposal identifies two possible technical approaches for author-controlled pronunciation of HTML content. Just as authors use CSS to style the visual presentation of web content, the Pronunciation Task Force aims to develop normative specifications for aural presentation. The Pronunciation Taks Force seeks implementors to convert the authoring techniques described into aural presentation. Feedback from implementors and authors will help the task force decide which approach to submit as the final recommendation.

We base our candidate approaches on a subset of SSML. Our selected SSML subset is carefully chosen to bring consistency and predictability to spoken presentation across a full range of assistive technologies and operating environments. Both technical approaches described in this publication carefully avoid the impass that has prevented SSML from becoming a native HTML technology and should, therefore, be generally applicable. Either approach described here satisfies our requirements for assistive technologies and are hypothesized to be useful to voice assistants which consume and present HTML content in spoken form. We seek feedback on which approach would prove most implementable across all applications of spoken presentation of web content.

For an introduction to pronunciation issue and related W3C documents, see: Pronunciation Overview.

Multi-attribute Approach for Including SSML in HTML

The multi-attribute approach uses one or more attributes with string values to add speech presentation to an HTML element. Publishers in Japan use a similar technique from EPUB 3 for the SSML phoneme element.

Edgar Allen Poe's The Raven, authored using the multi-attribute approach:

EXAMPLE 1

<p data-ssml-prosody-rate="slow" data-ssml-prosody-pitch="low">
    Once upon a midnight 
    <span data-ssml-phoneme-alphabet="ipa" data-ssml-phoneme-ph="ˈdrɪəri">dreary</span>
    <span data-ssml-break-time="500ms"></span>,
    while I pondered, weak
    <span data-ssml-break-time="150ms"></span> and weary,<br data-ssml-break-time="500ms" />
    Over many a quaint and curious volume of forgotten
    <span data-ssml-prosody-rate="x-slow" data-ssml-prosody-pitch="low"> lore—</span><br />
    While I nodded, nearly napping, suddenly there came a tapping,
    <br data-ssml-audio-src="/soundlibrary/wood/hits/hits_11" />
    As of some one gently rapping,
    <span data-ssml-audio-src="/soundlibrary/wood/hits/hits_11"></span>
    rapping at my chamber door.
    <span data-ssml-audio-src="/soundlibrary/wood/hits/hits_11"></span>
    <br data-ssml-audio-src="/soundlibrary/wood/hits/hits_11" />
    <span data-ssml-prosody-volume="x-soft" data-ssml-prosody-rate="medium">
      "'Tis some visitor,"
    </span>
    I muttered, <span data-ssml-prosody-volume="x-soft" data-ssml-prosody-rate="x-slow">
    <span data-ssml-phoneme-alphabet="ipa" data-ssem-phoneme-ph="tæpɪŋ">"tapping</span>
    at my chamber door—</span><br data-ssml-break-time="750ms" />
    Only this <span data-ssml-break-strength="weak"></span> and nothing
    <span data-ssml-break-strength="none"></span>
    <span data-ssml-prosody-volume="soft" data-ssml-prosody-rate="75%"> more."</span>
</p>

The `data-ssml-*` Multi-Attribute Set

These attributes provide functional equivalence to the SSML counterparts. These attributes are valid on the following HTML elements:

`data-ssml-say-as-*`

Allows the author to classify the element's text content. The attributes are derived from the SSML say-as element and associated properties. Editor's note: interpret-as seems superfluous, and should be implied

`data-ssml-say-as`

`data-ssml-say-as-format`(optional)

Value: time/date format as defined in W3C Note, SSML say-as attribute values. SSML 1.0 say-as attribute

`data-ssml-say-as-detail`(optional)

Value: detail as defined in W3C Note, SSML say-as attribute values. SSML 1.0 say-as attribute

EXAMPLE 2


According the 2010 US Census, the population of <span
data-ssml-say-as='characters'>90274</span>
increased to 25209 from 24976 over the past 10 years.

`data-ssml-phoneme-*`

Defines two required attributes for phonemic/phonetic pronunciation. The element with the phoneme attributes can only contain text (no elements). The attributes are derived from the SSML phoneme element and associated properties.

`data-ssml-phoneme-ph`

Value: The phoneme string

`data-ssml-phoneme-alphabet`

Value: The phonetic alphabet in use: ipa | x-sampa

EXAMPLE 3


Once upon a midnight <span data-ssml-alphabet="ipa" data-ssml-phoneme-ph="ˈdrɪəri">dreary</span>

`data-ssml-sub-alias`

A string value that replaces the text content for pronunciation. While similar to aria-label, alias does not alter spelling (i.e., a Braille display). Additionally, the alias attribute can be used by TTS technologies that do not access the accessibility tree. The processor should apply text normalization to the alias value. The attribute is derived from the SSML sub element and associated properties.

Value: text string to be substituted and delivered to the TTS for presentation.

EXAMPLE 4

<span data-ssml-sub-alias="Sodium Chloride"'>NaCL</span>

`data-ssml-voice-*`

A set of attributes defining production values that requests a change in speaking voice. There are two kinds of attributes for the voice element: those that indicate desired features of a voice and those that control behavior. The attributes are derived from the SSML voice element and associated properties.

`data-ssml-voice-gender` (optional)

Values: female | male | neutral

`data-ssml-voice-age` (optional)

Value: integer corresponding to age in years

`data-ssml-voice-variant` (optional)

Value: integer indicating a numeric voice variant

`data-ssml-voice-name` (optional)

Value: string defining a specific voice name requested from the current TTS engine, e.g., "David"

`data-ssml-voice-languages` (optional)

Value: string a space delimited list of one or more languages to be spoken by this voice.

EXAMPLE 5

She said, "<span data-ssml-voice-gender="female"'>My name is Marie</span>".

`data-ssml-emphasis-level`

Requests that the text content be spoken with emphasis (also referred to as prominence or stress). This is a single attribute and is derived from the SSML emphasis element and associated properties.

Value: strong | moderate | none | reduced

EXAMPLE 6

Please use <span data-ssml-emphasis-level="strong"'>extreme caution.</span>

`data-ssml-break-*`

Describes the timing associated with an empty element to control the pausing or other prosodic boundaries between tokens. The use of the break attribute between any pair of tokens is optional. If the element is not present between tokens, the synthesis processor is expected to automatically determine a break based on the linguistic context. The attributes are derived from the SSML break element and associated properties.

`data-ssml-break-strength`

`data-ssml-break-time`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

EXAMPLE 7

Take a deep breath,<span data-ssml-break-time="1s"></span> and exhale.

`data-ssml-prosody-*`

Permits control of the pitch, speaking rate and volume of the speech output. The attributes are derived from the SSML prosody element and associated properties.

`data-ssml-prosody-pitch` (optional)

`data-ssml-prosody-contour` (optional)

Value: string of contour change parameters as defined in the SSML 1.1 recommendation

`data-ssml-prosody-range` (optional)

Value: string range value as defined in the SSML 1.1 recommendation

`data-ssml-prosody-rate` (optional)

`data-ssml-prosody-duration` (optional)

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`data-ssml-prosody-volume` (optional)

EXAMPLE 8

The tortoise, said (slowly) "<span data-ssml-prosody-rate="x-slow">
I am almost at the finish line</span>.""

`data-ssml-audio-*`

Supports the insertion of recorded audio files in conjunction with synthesized speech output. The element may be empty. If the element is not empty, then the contents should be the text to be spoken if the audio document is not available. The attributes are derived from the SSML audio element and associated properties.

`data-ssml-audio-src`

Value: The URI of a document with an appropriate media file.

`data-ssml-audio-fetchtimeout` (optional)

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`data-ssml-audio-fetchint` (optional)

Value: safe | prefetch

`data-ssml-audio-maxage` (optional)

Value: string

`data-ssml-audio-maxstale` (optional)

Value: string

`data-ssml-audio-clipBegin` (optional)

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`data-ssml-audio-clipEnd` (optional)

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`data-ssml-audio-repeatCount` (optional)

Value: integer indicating the number of times to repeat the audio clip.

`data-ssml-audio-repeatDur` (optional)

string containing a time duration expressed in numeric form as "250ms", "1s", etc.

EXAMPLE 9

You will hear a brief chime <span data-ssml-audio-src="/audio/chime.ogg"></span> when your time is up.

Single-attribute Approach for Including SSML in HTML

The single-attribute approach uses JSON to allow SSML compatible properties and values to be applied to textual content in HTML. This approach emerged as a means to transform content conforming to the IMS Question & Test Interoperability (QTI) Specification. The QTI standard supports inclusion of SSML in HTML for TTS tools used in educational assessment. This approach has seen preliminary implementation by TextHelp.

This approach requires authors to encode SSML functions, properties and values into JSON, for inclusion in HTML. Then TTS tools must transform the JSON attribute content into SSML. No existing W3C recommendation uses JSON strings as attribute values in HTML, but there is evidence of its use for custom applications using the data attribute. JSON has potential security concerns, and the impact of malformed JSON strings resulting from errors in authoring is among issues that need wider review. The browser, which normally attempts to address malformed HTML, can make no guarantees about the JSON strings. Implementers must decide how to handle malformed JSON.

While a JSON schema is proposed, there are potentially related standards, such as SpeakableSpecification, in development. Converting SSML to a proper JSON schema could cause confusion for implementers and authors. Often such conversions are "...not exactly 1:1 transformation, but very very close". Authoring tools are hypothesized to address these concerns, by eliminating the need for authors to directly write JSON. Techniques to facilitate authoring of speech attributes (whether multi or single-attribute approaches) are already demonstrated [ref].

Edgar Allen Poe's The Raven, authored using the single-attribute approach:

EXAMPLE 1

<p data-ssml='{"prosody":{"rate":"slow";"pitch":"low"}}'>
	Once upon a midnight
        <span data-ssml='{"phoneme":{"alphabet":"ipa";ph:"ˈdrɪəri"}}'>dreary</span>
	<span data-ssml="{"break":{"time":"500ms"}'></span>,
	while I pondered, weak
	<span data-ssml='{"break":{"time":"150ms"}'></span> and weary,
        <br data-ssml='{"break":{"time":"500ms"}' />
	Over many a quaint and curious volume of forgotten 
	<span data-ssml='{"prosody":{"rate":"x-slow";"pitch":"low"}}'>lore—</span><br />
	While I nodded, nearly napping, suddenly there came a tapping,
	<br data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}'/>
	As of some one gently rapping,
	<span data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}'></span>
	rapping at my chamber door.
	<span data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}'></span>
	<br data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}' />
	<span data-ssml='{"prosody":{"volume":"x-soft";"rate":"medium"}}'>
          "'Tis some visitor,"
        </span>
	I muttered, <span data-ssml='{"prosody":{"volume":"x-soft";"rate":"x-slow"}}'>
	<span data-ssml='{"phoneme":{"alphabet":"ipa";ph:"tæpɪŋ"}}'>"tapping</span>
	at my chamber door—</span><br data-ssml='{"break":{"time":"750ms"}'/>
	Only this<span data-ssml='{"break":{"strength":"weak"}'></span>
	and nothing<span data-ssml='{"break":{"strength":"none"}'> </span>
	<span data-ssml='{"prosody":{"volume":"soft";"rate":"75%"}}'>more."</span>
</p>

`data-ssml` Attribute, Properties and Values

The following properties are defined and provide functional equivalence to the their SSML counterpart.

The data-ssml provides functional equivalence to SSML. The attribute is valid on the following HTML elements:

The value of the data-ssml attribute is a JSON string, enclosed with single quotes ('), containing a single JSON object representing a specific SSML function with one or more property/value pairs. The valid objects, properties and associated values are defined in the following sections. The JSON schema is presented in Appendix A.

`say-as`

Allows the author to classify the element's text content. The JSON definition is derived from the SSML say-as element and associated properties.

`interpret-as`

`format` (optional)

Value: time/date format as defined in W3C Note SSML say-as attribute values.

`detail` (optional)

Value: detail as defined in W3C Note SSML say-as attribute values.

EXAMPLE 2

According the 2010 US Census, the population of <span
data-ssml='{"say-as" :
{"interpret-as":"characters"}}'>90274</span>
increased to 25209 from 24976 over the past 10 years.

`phoneme`

Defines two required attributes for phonemic/phonetic pronunciation. The element with the phoneme attributes can only contain text (no elements). The JSON definition is derived from the SSML phoneme element and associated properties.

`ph`

Value: string containing the phonetic characters corresponding to the content to be spoken

`data-ssml-phoneme-alphabet`

Value: ipa | x-sampa defining the phonetic alphabet used for the ph string

EXAMPLE 3


Once upon a midnight <span data-ssml='{"phoneme":{"alphabet":"ipa";ph:"ˈdrɪəri"}}'>dreary</span>

`sub`

Indicates that the text in the alias attribute value replaces the text content for pronunciation. The required alias property specifies the string to be spoken instead of the text content. The processor should apply text normalization to the alias value. The JSON definition is derived from the SSML sub element and associated properties.

`alias`

Value: string containing the text to be spoken as a substitution for the text content of the element to which sub is applied.

EXAMPLE 4

<span data-ssml='{"sub":{"alias":"Sodium Chloride"}}'>NaCL</span>

`voice`

Requests a change in speaking voice. There are two kinds of attributes for voice: those that indicate desired features of a voice and those that control behavior. The JSON definition is derived from the SSML voice element and associated properties.

`gender` (optional)

Values: female | male | neutral

`age` (optional)

Value: integer corresponding to age in years

`variant` (optional)

Value: integer indicating a numeric voice variant

`name` (optional)

Value: string defining a specific voice name requested from the current TTS engine, e.g., "Microsoft David (English)"

`languages` (optional)

Value: string a space delimited list of one or more languages to be spoken by this voice.

EXAMPLE 5

She said, "<span data-ssml='{"voice":{"gender":"female"}}'>My name is Marie</span>".

`emphasis`

Requests that the text content of the element to which emphasis spoken with emphasis (also referred to as prominence or stress). The JSON definition is derived from the SSML emphasis element and associated properties.

`level`

Value: strong | moderate | none | reduced

EXAMPLE 6


Please use <span data-ssml='{"emphasis":{"level":"strong"}}'>extreme caution.</span>

`break`

Describes the timing associated with an empty element to control the pausing or other prosodic boundaries between tokens. The use of the break between any pair of tokens is optional. If the element is not present between tokens, the synthesis processor is expected to automatically determine a break based on the linguistic context. The JSON definition is derived from the SSML break element and associated properties.

`strength`

`time`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc. (s=second, ms=milliseconds)

EXAMPLE 7


Take a deep breath,<span data-ssml='{"break":{"time":"1s"}}'></span> and exhale.

`prosody`

Permits control of the pitch, speaking rate and volume of the speech output. The object has six properties. The JSON definition is derived from the SSML prosody element and associated properties.

`pitch`

`contour`

Value: string of contour change parameters as defined in the SSML 1.1 recommendation

`range`

Value: string range value as defined in the SSML 1.1 recommendation

`rate`

`duration`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`volume`

EXAMPLE 8


The tortoise, said (slowly) "<span data-ssml='{"prosody":{"rate":"x-slow"}}'>I am almost at the finish line</span>.""

`audio`

Supports the insertion of recorded audio files in conjunction with synthesized speech output. The element may be empty. If the element is not empty, then the contents should be the text to be spoken if the audio document is not available. The JSON definition is derived from the SSML audio element and associated properties.

`src`

Value: The URI of a document with an appropriate media file.

`fetchtimeout`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`fetchint`

Value: safe | prefetch

`maxage`

Value: string

`maxstale`

Value: string

`clipBegin`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`clipEnd`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`repeatCount`

Value: integer indicating the number of times to repeat the audio clip.

`repeatDur`

string containing a time duration expressed in numeric form as "250ms", "1s", etc.

EXAMPLE 9


You will hear a brief chime <span data-ssml='{"audio":{"src":"/audio/chime.ogg"}}'></span> when your time is up.

Introduction

Background on Pronunciation

User Scenarios

Pronunciation Gap Analysis and Use Cases

Multi-attribute Approach for Including SSML in HTML

The data-ssml-* Multi-Attribute Set

data-ssml-say-as-*

data-ssml-say-as

data-ssml-say-as-format(optional)

data-ssml-say-as-detail(optional)

data-ssml-phoneme-*

data-ssml-phoneme-ph

data-ssml-phoneme-alphabet

data-ssml-sub-alias

data-ssml-voice-*

data-ssml-voice-gender (optional)

data-ssml-voice-age (optional)

data-ssml-voice-variant (optional)

data-ssml-voice-name (optional)

data-ssml-voice-languages (optional)

data-ssml-emphasis-level

data-ssml-break-*

data-ssml-break-strength

data-ssml-break-time

data-ssml-prosody-*

data-ssml-prosody-pitch (optional)

data-ssml-prosody-contour (optional)

data-ssml-prosody-range (optional)

data-ssml-prosody-rate (optional)

data-ssml-prosody-duration (optional)

data-ssml-prosody-volume (optional)

data-ssml-audio-*

data-ssml-audio-src

data-ssml-audio-fetchtimeout (optional)

data-ssml-audio-fetchint (optional)

data-ssml-audio-maxage (optional)

data-ssml-audio-maxstale (optional)

data-ssml-audio-clipBegin (optional)

data-ssml-audio-clipEnd (optional)

data-ssml-audio-repeatCount (optional)

data-ssml-audio-repeatDur (optional)

Single-attribute Approach for Including SSML in HTML

data-ssml Attribute, Properties and Values

say-as

interpret-as

format (optional)

detail (optional)

phoneme

ph

data-ssml-phoneme-alphabet

sub

alias

voice

gender (optional)

age (optional)

variant (optional)

name (optional)

languages (optional)

emphasis

level

break

strength

time

prosody

pitch

contour

range

rate

duration

volume

audio

src

fetchtimeout

fetchint

maxage

maxstale

clipBegin

clipEnd

repeatCount

repeatDur

The `data-ssml-*` Multi-Attribute Set

`data-ssml-say-as-*`

`data-ssml-say-as`

`data-ssml-say-as-format`(optional)

`data-ssml-say-as-detail`(optional)

`data-ssml-phoneme-*`

`data-ssml-phoneme-ph`

`data-ssml-phoneme-alphabet`

`data-ssml-sub-alias`

`data-ssml-voice-*`

`data-ssml-voice-gender` (optional)

`data-ssml-voice-age` (optional)

`data-ssml-voice-variant` (optional)

`data-ssml-voice-name` (optional)

`data-ssml-voice-languages` (optional)

`data-ssml-emphasis-level`

`data-ssml-break-*`

`data-ssml-break-strength`

`data-ssml-break-time`

`data-ssml-prosody-*`

`data-ssml-prosody-pitch` (optional)

`data-ssml-prosody-contour` (optional)

`data-ssml-prosody-range` (optional)

`data-ssml-prosody-rate` (optional)

`data-ssml-prosody-duration` (optional)

`data-ssml-prosody-volume` (optional)

`data-ssml-audio-*`

`data-ssml-audio-src`

`data-ssml-audio-fetchtimeout` (optional)

`data-ssml-audio-fetchint` (optional)

`data-ssml-audio-maxage` (optional)

`data-ssml-audio-maxstale` (optional)

`data-ssml-audio-clipBegin` (optional)

`data-ssml-audio-clipEnd` (optional)

`data-ssml-audio-repeatCount` (optional)

`data-ssml-audio-repeatDur` (optional)

`data-ssml` Attribute, Properties and Values

`say-as`

`interpret-as`

`format` (optional)

`detail` (optional)

`phoneme`

`ph`

`data-ssml-phoneme-alphabet`

`sub`

`alias`

`voice`

`gender` (optional)

`age` (optional)

`variant` (optional)

`name` (optional)

`languages` (optional)

`emphasis`

`level`

`break`

`strength`

`time`

`prosody`

`pitch`

`contour`

`range`

`rate`

`duration`

`volume`

`audio`

`src`

`fetchtimeout`

`fetchint`

`maxage`

`maxstale`

`clipBegin`

`clipEnd`

`repeatCount`

`repeatDur`