Abstract

This specification is provided to promote interoperability among implementations and users of in-band text tracks sourced for [HTML5]/[HTML] from media resource containers. The specification provides guidelines for the creation of video, audio and text tracks and their attribute values as mapped from in-band tracks from media resource types typically supported by User Agents. It also explains how the UA should map in-band text track content into text track cues.

Mappings are defined for [MPEGDASH], [ISOBMFF], [MPEG2TS], [OGGSKELETON] and [WebM].

Status of This Document

This document is merely a public working draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organisation.

This is the first draft. Please send feedback to: public-inbandtracks@w3.org.

Table of Contents

1. Introduction

The specification maintains mappings from in-band audio, video and other data tracks of media resources to HTML VideoTrack, AudioTrack, and TextTrack objects and their attribute values.

A generic rule to follow is that a track as exposed in HTML only ever represents a single semantic concept. When mapping from a media resource, sometimes an in-band track does not relate 1-to-1 to a HTML text, audio or video track.

Note

For example, a HTML TextTrack object is either a subtitle track or a caption track, never both. However, in-band text tracks may encapsulate caption and subtitle cues of the same language as a single in-band track. Since a caption track is essentially a subtitle track with additional cues of transcripts of audio-only information, such an encapsulation in a single in-band track can save space. In HTML, these tracks should be exposed as two TextTrack objects, since they represent different semantic concepts. The cues appear in their relevant tracks - subtitle cues would be present in both. This allows users to choose between the two tracks and activate the desired one in the same manner that they do when the two tracks are provided through two track elements.

Note

A similar logic applies to in-band text tracks that have subtitle cues of different languages mixed together in one track. They, too, should be exposed in a track of their own language each.

Note

A further example is when a UA decides to implement rendering for a caption track but without exposing the caption track through the TextTrack API. To the Web developer and the Web page user, such a video appears as though it has burnt-in captions. Therefore, the UA could expose two video tracks on the HTMLMediaElement - one with captions and a kind="captions" and one without captions with a kind="main". In this way, the user and the Web developer still get the choice of whether to see the video with or without captions.

Another generic rule to follow for in-band data tracks is that in order to map them to TextTrack objects, the contents of the track need to be mapped to media-time aligned cues that relate to a non-zero interval of time.

For every MIME-type/subtype of an existing media container format, this specification defines the following information:

  1. Track order.
  2. How to identify the type of tracks.
  3. Setting track attributes 'id', 'kind', 'language' and 'label' for sourced Text Tracks.
  4. Setting track attributes 'id', 'kind', 'language' and 'label' for sourced Audio and Video tracks.
  5. Mapping Text Track content into text track cues.

2. MPEG DASH

MIME type/subtype: application/dash+xml
  1. Track Order

    The order of tracks specified in the MPD (Media Presentation Description) format [MPEGDASH] is maintained when sourcing multiple MPEG DASH tracks into HTML.

  2. Determining the type of track

    A user agent recognises and supports data from a MPEG DASH media resource as being equivalent to a HTML track based on the AdaptationSet or ContentComponent mimeType:

    • text track: the mimeType is of main type "application" or "text"
    • video track: the mimeType is of main type "video"
    • audio track: the mimeType is of main type "audio"
  3. Track Attributes for sourced Text Tracks

    Data for sourcing text track attributes may exist in the media content or in the MPD. Text track attribute values are first sourced from track data in the media container, as described for text track attributes in MPEG-2 Transport Streams and text track attributes in MPEG-4 ISOBMFF. If a track attribute value cannot be determined from the media container, then the track attribute value is sourced from data in the MPD as follows:

    Attribute How to source its value
    id Content of the 'id' attribute in the AdaptationSet or ContentComponent element. Empty string if 'id' attribute is not present.
    kind

    Given URN="urn:mpeg:dash:role:2011":

    • "captions": if the role descriptor's value is "caption"
    • "subtitles": if the role descriptor's value is "subtitle"
    • "metadata": otherwise
    label The empty string.
    language Content of the 'lang' attribute in the AdaptationSet or ContentComponent element.
    inBandMetadataTrackDispatchType If @kind is "metadata" the concatenation of the AdaptationSet element and all child Role descriptors. The empty string otherwise.
  4. Track Attributes for sourced Audio and Video Tracks

    Data for sourcing audio and video track attributes may exist in the media content or in the MPD. Audio and video track attribute values are first sourced from track data in the media container, as described for audio and video track attributes in MPEG-2 Transport Streams and audio and video track attributes in MPEG-4 ISOBMFF. If a track attribute value cannot be determined from the media container, then the track attribute value is sourced from data in the MPD as follows:

    Attribute How to source its value
    id Content of the 'id' attribute in the AdaptationSet or ContentComponent element. Empty string if 'id' attribute is not present.
    kind

    Given a role scheme of "urn:mpeg:dash:role:2011", determine the 'kind' attribute from the value of the role descriptors in the AdaptationSet element.

    • "alternative": if the role is "alternate" but not also "main" or "commentary", or "dub"
    • "captions": if the role is "caption" and also "main"
    • "descriptions": if the role is "description" and also "supplementary"
    • "main": if the role is "main" but not also "caption", "subtitle", or "dub"
    • "main-desc": if the role is "main" and also "description"
    • "sign": not used
    • "subtitles": if the role is "subtitle" and also "main"
    • "translation": if the role is "dub" and also "main"
    • "commentary": if the role is "commentary" but not also "main"
    • "": otherwise
    label The empty string.
    language Content of the 'lang' attribute in the AdaptationSet or ContentComponent element.
  5. Mapping Text Track content into text track cues

    TextTrackCues may be sourced from DASH media content in the WebVTT, TTML, MPEG-2 TS or ISOBMFF format.

    Media content with the MIME type "text/vtt" is in the WebVTT format and should be exposed as a VTTCue as defined in [WEBVTT].

    Media content with the MIME type "application/ttml+xml" is in the TTML format and should be exposed as an as yet to be defined TTMLCue. Alternatively, browsers can also map the TTML features to WebVTTCue objects. Finally, browsers that cannot render TTML [ttaf1-dfxp] format data should expose them as DataCue objects [HTML5]. In this case, the TTML file must be parsed in its entirety and then converted into a sequence of TTML Intermediate Synchronic Documents (ISDs). Each ISD creates a DataCue object with attributes sourced as follows:

    Attribute How to source its value
    id Decimal representation of the ‘id’ attribute of the ‘head’ element in the XML document. Null if there is no ‘id’ attribute.
    startTime Value of the beginning media time of the active temporal interval of the ISD.
    endTime Value of the ending media time of the active temporal interval of the ISD.
    pauseOnExit "false"
    data The (UTF-16 encoded) ArrayBuffer composing the ISD resource.

    Media content with the MIME type "application/mp4" or "video/mp4" is in the ISOBMFF format and should be exposed following the same rules as for ISOBMFF text track.

    Media content with the MIME type "video/mp2t" is in the MPEG-2 TS format and should be exposed following the same rules as for MPEG-2 TS text track.

3. MPEG-2 Transport Streams

MIME type/subtype: audio/mp2t , video/mp2t
  1. Track Order

    Tracks are called "elementary streams" in a MPEG-2 Transport Stream (TS) [MPEG2TS]. The order in which elementary streams are listed in the "Program Map Table" (PMT) of a MPEG-2 TS is maintained when sourcing multiple MPEG-2 tracks into HTML.

    Note

    The order of elementary streams in the PMT may change between when the media resource was created and when it is received by the user agent. Scripts should not infer any information from the ordering, or rely on any particular ordering being present.

  2. Determining the type of track

    A user agent recognises and supports data from a MPEG-2 TS resource as being equivalent to a HTML track based on the value of the 'stream_id' field of an elementary stream as given in a Transport or Program Stream header and which maps to a "stream type":

    • text track:
      • The elementary stream with PID 0x02 or the 'stream_type' value is "0x02", "0x05" or between "0x80" and "0xFF".
      • The CEA 708 caption service [CEA708], as identified by:
        • A 'caption_service_descriptor' [ATSC65] in the 'Elementary Stream Descriptors' in the PMT entry for a video stream with stream type 0x02 or 0x1B.
        • For 'stream_type' 0x02, the presence of caption data in the 'user_data()' field [ATSC52].
        • For stream type 0x1B, the presence of caption data in the ‘ATSC1_data()’ field [SCTE128-1].
      • a DVB subtitle component [DVB-SUB] as identified by a 'subtitling_descriptor' [DVB-SI]in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a 'stream_type' of "0x06"
      • an ITU-R System B Teletext component [DVB-TXT] as identified by an 'teletext_descriptor' [DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a 'stream_type' of "0x06"
      • a VBI data component [DVB-VBI] as identified by a 'VBI_data_descriptor' [DVB-SI] or a 'VBI_teletext_descriptor' [DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a 'stream_type' of "0x06"
    • video track: the stream type value is "0x01", "0x02", "0x10", "0x1B", between "0x1E" and "0x24" or "0xEA".
    • audio track:
      • the stream type value is "0x03", "0x04", "0x0F", "0x11", "0x1C", "0x81" or "0x87".
      • an AC-3 audio component as identified by an 'AC-3_descriptor' [DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a 'stream_type' of "0x06"
      • an Enhanced AC-3 audio component as identified by an 'enhanced_ac-3_descriptor' [DVB-SI]in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a 'stream_type' of "0x06"
      • a DTS® audio component as identified by a 'DTS_audio_stream_descriptor' [DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a 'stream_type' of "0x06"
      • a DTS-HD® audio component as identified by a 'DTS-HD_audio_stream_descriptor' [DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with a 'stream_type' of "0x06"
  3. Track Attributes for sourced Text Tracks

    Attribute How to source its value
    id Decimal representation of the elementary stream's identifier ('elementary_PID' field) in the PMT.

    In the case of CEA 708 closed captions, decimal representation of the 'caption_service_number' in the 'Caption Service Descriptor' in the PMT.

    If program 0 (zero) is present in the transport stream, a string of the format "OOOO.TTTT.SSSS.CC" consisting of the following, lower-case hexadecimal encoded fields:

    • OOOO is the four character representation of the 16-bit 'original_network_id' [DVB-SI].
    • TTTT is the four character representation of the 16-bit 'transport_stream_id' [DVB-SI].
    • SSSS is the four character representation of the 16-bit 'service_id' [DVB-SI].
    • CC is:
      • If a 'stream_identifier_descriptor' [DVB-SI] is present in the PMT, a two character representation of the 8-bit 'component_tag' value.
      • Otherwise, a four character representation of the elementary stream's identifier (13-bit 'elementary_PID' field) in the PMT.

    kind
    • "captions":
      • For a CEA708 caption service.
      • for a DVB subtitle component [DVB-SUB] as identified by a 'subtitling_descriptor' [DVB-SI] in the PMT with a 'subtitling_type' in the range "0x20" to "0x25".
      • an ITU-R System B Teletext component [DVB-TXT] as identified by an 'teletext_descriptor' [DVB-SI] with a 'teletext_type' value of "0x05" in the PMT
      • a VBI data component [DVB-VBI] as identified by a 'VBI_teletext_descriptor' [DVB-SI] with a 'teletext_type' value of "0x05" in the PMT.
    • "subtitles":
      • If the stream type value is "0x82".
      • for a DVB subtitle component [DVB-SUB] as identified by a 'subtitling_descriptor' [DVB-SI] in the PMT with a 'subtitling_type' in the range "0x10" to "0x15".
      • an ITU-R System B Teletext component [DVB-TXT] as identified by an 'teletext_descriptor' [DVB-SI] with a 'teletext_type' value of "0x02" in the PMT
      • a VBI data component [DVB-VBI] as identified by a 'VBI_teletext_descriptor' [DVB-SI] with a 'teletext_type' value of "0x02" in the PMT.
    • "metadata": otherwise
    label
    • If a 'component_name_descriptor' [ATSC65] is found immediately after the 'ES_info_length' field in the Program Map Table [MPEG2TS], the DOMString representation of the 'component_name_string' in that 'component_name_descriptor'.
    • If a 'component_descriptor' [DVB-SI] for the component is present in the SDT or EIT, the DOMString representation of the content of the text field in that 'component_descriptor'
    • The empty string otherwise.
    language @kind is
    • "captions":
      • For a CEA708 caption service.
        • Content of the 'language' field for the caption service in the 'caption_service_descriptor', if present.
        • Otherwise, for the first caption service, as identified by the 'service_number' field in the 'service_block' [CEA708] with a value of 1, the value of '@language' of the audio track where '@kind' has the value "main".
        • The empty string for all other caption services, as identified by values greater than 1 in the 'service_number' field.
      • For a DVB subtitle component [DVB-SUB], the value of the 'ISO_639_language_code' field in the 'subtitling_descriptor' [DVB-SI] in the PMT
      • For an ITU-R System B Teletext component [DVB-TXT], the value of the 'ISO_639_language_code' field in the 'teletext_descriptor' [DVB-SI] in the PMT
      • For a VBI data component [DVB-VBI], the value of the 'ISO_639_language_code' field in the 'VBI_teletext_descriptor' [DVB-SI] in the PMT
    • "subtitles":
      • If 'stream_type' value is "0x82", the content of the 'ISO_639_language_code' field in the 'ISO_639_language_descriptor' in the elementary stream descriptor array in the PMT.
      • for a DVB subtitle component [DVB-SUB], the value of the 'ISO_639_language_code' field in the 'subtitling_descriptor' [DVB-SI] in the PMT
      • for an ITU-R System B Teletext component [DVB-TXT], the value of the 'ISO_639_language_code' field in the 'teletext_descriptor' [DVB-SI] in the PMT
      • for a VBI data component [DVB-VBI], the value of the 'ISO_639_language_code' field in the 'VBI_teletext_descriptor' [DVB-SI] in the PMT
    • "metadata": The empty string.
    inBandMetadataTrackDispatchType If @kind is "metadata", then the concatenation of the 'stream_type' byte field in the program map table and 'ES_info_length' bytes following the 'ES_info_length' field expressed in hexadecimal using uppercase ASCII hex digits. The empty string otherwise.
  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id
    • Decimal representation of the elementary stream's identifier ('elementary_PID' field) in the PMT.
    • If a program 0 (zero) is present in the transport stream, a string of the format "OOOO.TTTT.SSSS.CC" or "OOOO.TTTT.SSSS.CC&CC", consisting of the following, lower-case hexadecimal encoded fields:
      • OOOO is the four character representation of the 16-bit 'original_network_id' [DVB-SI].
      • TTTT is the four character representation of the 16-bit 'transport_stream_id' [DVB-SI].
      • SSSS is the four character representation of the 16-bit 'service_id' [DVB-SI].
      • CC is:
        • If a 'stream_identifier_descriptor' [DVB-SI] is present in the PMT, a two character representation of the 8-bit 'component_tag' value.
        • Otherwise, a four character representation of the elementary stream's identifier (13-bit 'elementary_PID' field) in the PMT.

      Where a track is derived from two components, the second form ("CC&CC") identifies the independent and dependent streams, where the first 'CC' identifies the independent stream, and the second 'CC' identifies the dependent stream. Otherwise the first form is used.

    kind
    • If a 'supplementary_audio_descriptor' [DVB-SI] is present in the PMT for an audio component, the value is derived according to the audio purpose defined in table J.3 of [DVB-SI] using the following rules:
      • "main" if PSI signalling of audio purpose indicates "Main audio" for the audio track that the user agent would select by default, otherwise to "translation"
        Note

        Need to define how UA would select track by default.

      • components with an audio purpose of "Audio description (broadcast-mix)" map to "main-desc"
      • components with an audio purpose of "Audio description (receiver-mix)":
        • The user agent exposes an audio track of @kind "main-desc" for each permitted combination of this track with another audio track as defined in annex J.2 of [DVB-SI]. Enabling this track results in the combination being presented.
        • If the user agent can present the stream in isolation, it also exposes an audio track of @kind "descriptions" for this audio component.
      • components with an audio purpose of "Clean audio (broadcast-mix)", "Parametric data dependent stream", or "Unspecific audio for the general audience" map to "alternative"
      • components with other audio purposes map to the empty string
    • Otherwise:
      • "descriptions":
        • For AC-3 audio [ATSC52] if the 'bsmod' field is 2 and the 'full_svc' field is 0 in the 'AC-3_audio_stream_descriptor()' in the PMT
        • For E-AC-3 audio [ATSC52] if the 'audio_service_type' field is 2 and the 'full_service_flag' is 0 in the 'E-AC-3_audio_descriptor()' in the PMT
        • For AAC audio [SCTE193-2] if the 'AAC_service_type' field is 2 and the 'receiver_mix_rqd' is 1 in the 'MPEG_AAC_descriptor()' in the PMT
      • "main" if the first audio (video) elementary stream in the PMT and the 'audio_type' field in the 'ISO_639_language_descriptor', if present, is "0x00" or "0x01"
      • "main-desc":
        • For AC-3 audio [ATSC52] if the 'bsmod' field is 2 and the 'full_svc' field is 1 in the 'AC-3_audio_stream_descriptor()'
        • For E-AC-3 audio [ATSC52] if the 'audio_service_type' field is 2 and the 'full_service_flag' is 1 in the 'E-AC-3_audio_descriptor()'
        • For AAC audio [SCTE193-2] if the 'AAC_service_type' field is 2 and the 'receiver_mix_rqd' is 0 in the 'MPEG_AAC_descriptor()'
      • "sign" video components with a 'component_descriptor' [DVB-SI] in the SDT or EIT, where the 'stream_content' is "0x3" and the 'component_type' is "0x30" or "0x31"
      • "translation": not first audio elementary stream in the PMT and the 'audio_type' field in the 'ISO_639_language_descriptor' is "0x00" or "0x01" and bsmod=0
      • "": otherwise
    label
    • If a 'component_descriptor' [DVB-SI] is present in the SDT or EIT, the DOMString representation of the content of the text field in that 'component_descriptor'
    • If a 'component_name_descriptor' [ATSC65] is present for this elementary in the Program Map Table [MPEG2TS], the DOMString representation of the 'component_name_string' field in that descriptor .
    • The empty string otherwise.
    language @kind is:
    • "descriptions" or "main-desc": Content of the 'language' field in the 'AC-3_audio_stream_descriptor' or 'AC-3_audio_stream_descriptor' [ATSC52] if present.
    • otherwise: Content of the 'ISO_639_language_code' field in the 'ISO_639_language_descriptor'.
  5. Mapping Text Track content into text track cues for MPEG-2 TS

    MPEG-2 transport streams may contain data that should be exposed as cues on 'captions', 'subtitles' or 'metadata' text tracks. No data is defined that equates to 'descriptions' or 'chapters' text track cues.

    1. Metadata cues

      Cues on an MPEG-2 metadata text track are created as DataCue objects [HTML5]. Each 'section' in an elementary stream identified as a text track creates a DataCue object with its TextTrackCue attributes sourced as follows:

      Attribute How to source its value
      id Decimal representation of the 'table_id' in the first 8 bits of the 'section' data.
      startTime 0
      endTime The time, in the media resource timeline, that corresponds to the presentation time of the video frame received immediately prior to the 'section' in the media resource.
      pauseOnExit "false"
      data The 'section_length' number of bytes immediately following the 'section_length' field in the 'section'.
    2. Captions cues

      • CEA 708

        MPEG-2 TS captions in the CEA 708 format [CEA708] are carried in the video stream in Picture User Data [ATSC53-4] for 'stream_type' 0x02 and in Supplemental Enhancement Information [ATSC72-1] for 'stream_type' 0x1B. Browsers that can render the CEA 708 format should expose them in as yet to be specified CEA708Cue objects. Alternatively, browsers can also map the CEA 708 features to WebVTTCue objects [VTT708]. Finally, browsers that cannot render CEA 708 captions should expose them as DataCue objects [HTML5]. In this case, each 'service block' in a digital TV closed caption (DTVCC) transport channel creates a DataCue object with TextTrackCue attributes sourced as follows:

        Attribute How to source its value
        id Decimal representation of the 'service_number' in the 'service_block'.
        startTime The time, in the HTML media resource timeline, that corresponds to the presentation time stamp for the video frame that contained the first 'Caption Channel Data Byte' of the 'service_block'.
        endTime The sum of the startTime and 4 seconds.
        Note

        CEA 708 captions do not have an explicit end time - a rendering device derives the end time for a caption based on subsequent caption data. Setting endTime equal to startTime might be more appropriate but this would require better support for zero-length cues, as proposed in HTML Bug 25693.

        pauseOnExit "false"
        data The 'service_block'.
      • DVB

        MPEG-2 TS captions in the DVB subtitle format [DVB-SUB], ITU-R System B Teletext [DVB-TXT] and VBI [DVB-VBI] formats are not exposed in a TextTrackCue.

    3. Subtitles cues

      • SCTE 27

        MPEG-2 TS subtitles in the SCTE 27 format [SCTE27] should should be exposed in an as yet to be specified SCTE27Cue objects. Alternatively, browsers can also map the SCTE 27 features to WebVTTCue object via an as yet to be specified mapping process. Finally, browsers that cannot render SCTE 27 subtitles, should expose them as DataCue objects [HTML5]. In this case, each 'section' in an elementary stream identified as a subtitles text track creates a DataCue object with TextTrackCue attributes sourced as follows:

        Attribute How to source its value
        id Decimal representation of the 'table_id' in the first 8 bits of the 'section' data.
        startTime The time, in the HTML media resource timeline, that corresponds to the 'display_in_PTS' field in the section data.
        endTime The sum of the startTime and the 'display_duration' field in the section data expressed in seconds.
        pauseOnExit "false"
        data The 'section_length' number of bytes immediately following the 'section_length' field in the 'section'.
      • DVB

        MPEG-2 TS subtitles in the DVB subtitle format [DVB-SUB], ITU-R System B Teletext [DVB-TXT] and VBI [DVB-VBI] formats are not exposed in a TextTrackCue.

4. MPEG-4 ISOBMFF

MIME type/subtype: audio/mp4 , video/mp4
  1. Track Order

    The order of tracks specified by TrackBox ('trak') boxes in the MovieBox ('moov') container [ISOBMFF] is maintained when sourcing multiple MPEG-4 tracks into HTML.

  2. Determining the type of track

    A user agent recognises and supports data from a MPEG-4 TrackBox as being equivalent to a HTML track based on the value of the 'handler_type' field in the HandlerBox ('hdlr) of the MediaBox ('mdia') of the TrackBox:

    • text track: the 'handler_type' value is "meta", "subt" or "text"
    • video track: the 'handler_type' value is "soun"
    • audio track: the 'handler_type' value is "vide"
  3. Track Attributes for sourced Text Tracks

    Attribute How to source its value
    id Decimal representation of the 'track_ID' of a TrackHeaderBox ('tkhd') in a TrackBox ('trak').
    kind
    • "captions":
      • WebVTT caption: 'handler_type' is "text" and SampleEntry format is 'WVTTSampleEntry' [ISO14496-30] and the VTT metadata header 'Kind' is "captions"
      • SMPTE-TT caption: 'handler_type' is "subt" and SampleEntry format is 'XMLSubtitleSampleEntry' [ISO14496-30] and the 'namespace' is set to "http://www.smpte-ra.org/schemas/2052-1/2013/smpte-tt#cea708 [SMPTE2052-11].
      • 3GPP caption:'handler_type' is "text" and the SampleEntry code ('format' field) is "tx3g".
        Note

        Are all sample entries of this type "captions"?

    • "subtitles":
      • WebVTT subtitle: 'handler_type' is "text" and SampleEntry format is 'WVTTSampleEntry' [ISO14496-30] and the VTT metadata header 'Kind' is "subtitles"
      • SMPTE-TT subtitle: 'handler_type' is "subt" and SampleEntry format is 'XMLSubtitleSampleEntry' [ISO14496-30] and the 'namespace' is set to a TTML namespace that does not indicate a SMPTE-TT caption.
    • "metadata": otherwise
    label Content of the 'name' field in the HandlerBox.
    language Content of the 'language' field in the MediaHeaderBox.
    inBandMetadataTrackDispatchType
    • @kind is "metadata":
      • if a 'XMLMetaDataSampleEntry' box is present the concatenation of the string "metx", a U+0020 SPACE character, and the value of the 'namespace' field
      • if a 'TextMetaDataSampleEntry' box is present the concatenation of the string "mett", a U+0020 SPACE character, and the value of the 'mime_format field'
      • otherwise the empty string
    • otherwise the empty string
  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id Decimal representation of the 'track_ID' of a TrackHeaderBox ('tkhd') in a TrackBox ('trak').
    kind
    • "alternative": not used
    • "captions": not used
    • "descriptions": not used
    • "main": first audio (video) track
    • "main-desc": not used
    • "sign": not used
    • "subtitles": not used
    • "translation": not first audio (video) track
    • "commentary": not used
    • "": otherwise
    label Content of the 'name' field in the HandlerBox.
    language Content of the 'language' field in the MediaHeaderBox.
  5. Mapping Text Track content into text track cues for MPEG-4 ISOBMFF

    ISOBMFF text tracks may be in the WebVTT or TTML format [ISO14496-30], 3GPP Timed Text format [3GPP-TT], or other format.

    ISOBMFF text tracks carry WebVTT data if the media handler type is "text" and a 'WVTTSampleEntry' format is used, as described in [ISO14496-30]. Browsers that can render text tracks in the WebVTT format should expose a VTTCue [WEBVTT] as follows:

    Attribute How to source its value
    id The 'cue_id' field in the 'CueIDBox'.
    startTime The sample presentation time.
    endTime The sum of the startTime and the sample duration.
    pauseOnExit "false"
    cue setting attributes The 'settings' field in the 'CueSettingsBox'.
    text The 'cue_text' field in the 'CuePayloadBox'.

    ISOBMFF text tracks carry TTML data if the media handler type is "subt" and an 'XMLSubtileSampleEntry' format is used with a TTML-based 'name_space' field, as described in [ISO14496-30]. Browsers that can render text tracks in the TTML format should expose an as yet to be defined TTMLCue. Alternatively, browsers can also map the TTML features to WebVTTCue objects. Finally, browsers that cannot render TTML [ttaf1-dfxp] format data should expose them as DataCue objects [HTML5]. Each TTML subtitle sample consists of an XML document and creates a DataCue object with attributes sourced as follows:

    Attribute How to source its value
    id Decimal representation of the ‘id’ attribute of the ‘head’ element in the XML document. Null if there is no ‘id’ attribute.
    startTime Value of the beginning media time of the top-level temporal interval of the XML document.
    endTime Value of the ending media time of the top-level temporal interval of the XML document.
    pauseOnExit "false"
    data The (UTF-16 encoded) ArrayBuffer composing the XML document.

    TTML data may contain tunneled CEA708 captions [SMPTE2052-11]. Browsers that can render CEA708 data should expose it as defined for MPEG-2 TS CEA708 cues.

    3GPP timed text data is carried in ISOBMFF as described in [3GPP-TT]. Browsers that can render text tracks in the 3GPP Timed Text format should expose an as yet to be defined 3GPPCue. Alternatively, browsers can also map the 3GPP features to WebVTTCue objects.

5. WebM

MIME type/subtype: audio/webm , video/webm
  1. Track Order

    The order of tracks specified in the EBML initialisation segment [WebM] is maintained when sourcing multiple WebM tracks into HTML.

  2. Determining the type of track

    A user agent recognises and supports data from a WebM resource as being equivalent to a HTML track based on the value of the 'TrackType' field of the track in the Segment info:

    • text track: 'TrackType' field is "0x11" or "0x21"
    • video track: 'TrackType' field is "0x01"
    • audio track: 'TrackType' field is "0x02"
  3. Track Attributes for sourced Text Tracks

    WebM has defined how to store WebVTT [WEBVTT] files in WebM [WebM][WEBVTT-WEBM]. Sourcing text tracks from WebM is different for chapter tracks from tracks of other kinds and is explained below the table.

    Attribute How to source its value
    id Decimal representation of the 'TrackNumber' field of the track in the "Track" section of the WebM file Segment.
    kind

    Map the content of the 'TrackType' and 'CodecID' fields of the track as follows:

    • "captions": 'TrackType' is "0x11" and 'CodecId' is “D_WEBVTT/captions“
    • "subtitles": 'TrackType' is "0x11" and 'CodecId' is “D_WEBVTT/subtitles“
    • "descriptions": 'TrackType' is "0x11" and 'CodecId' is “D_WEBVTT/descriptions“
    • "metadata": otherwise
    label Content of the 'name' field of the track.
    language Content of the 'language' field of the track.
    inBandMetadataTrackDispatchType If @kind is "metadata", then the value of the 'CodecID' element. The empty string otherwise.

    Tracks of kind "chapters" are found in the "Chapters" section of the WebM file Segment, which are all at the beginning of the WebM file, such that chapters can be used for navigation. The details of this mapping have not been specified yet and simply point to the more powerful Matroska chapter specification [Matroska]. Presumably, the 'id' attribute could be found in 'EditionUID', 'label' is empty, and 'language' can come from the first ChapterAtom's 'ChapLanguage' value.

    Note

    The Matroska container format, which is the basis for WebM, has specifications for other text tracks, in particular SRT, SSA/ASS, and VOBSUB. The described attribute mappings can be applied to these, too, except that the 'kind' field will always be "subtitles". The information of their 'CodecPrivate' field is exposed in the 'inBandMetadataTrackDispatchType' attribute.

  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id Decimal representation of the 'TrackNumber' field of the track in the Segment info.
    kind
    • "alternative": not used
    • "captions": not used
    • "descriptions": not used
    • "main": the 'FlagDefault' element is set on the track
    • "main-desc": not used
    • "sign": not used
    • "subtitles": not used
    • "translation": not first audio (video) track
    • "commentary": not used
    • "": otherwise
    label Content of the 'name' field of the track in the Segment info.
    language Content of the 'language' field of the track in the Segment info.
  5. Mapping Text Track content into text track cues

    The only types of text tracks that WebM is defined for are in the WebVTT format [WEBVTT-WEBM]. Therefore, cues on a text track are created as VTTCue objects [WEBVTT]. Each 'Block' in the 'BlockGroup' of the WebM track that has the actual data of the text track creates a VTTCue object with its TextTrackCue attributes sourced as follows:

    Attribute How to source its value
    id First line of the Block's data.
    startTime Calculated from the 'BlockTimecode' field in the Block's header and the 'Timecode' field in the Cluster relative to which 'BlockTimecode' is specified.
    endTime Calculated from the 'BlockDuration' filed in the Block's header and the startTime.
    pauseOnExit "false"
    cue setting attributes Parsed from the second line of the Block's data.
    text The third and all following lines of the Block's data.
    Note

    Other Matroska container format's text tracks can also be mapped to TextTrackCue objects. These will be created as DataCue objects [HTML5] with 'id', 'startTime', 'endTime', and 'pauseOnExit' attributes filled identically to the VTTCue objects, and the 'data' attribute containing the Block's data.

6. Ogg

MIME type/subtype: audio/ogg , video/ogg
  1. Track Order

    The order of tracks specified in the Skeleton fisbone headers [OGGSKELETON] is maintained when sourcing multiple Ogg tracks into HTML. If no Skeleton track is available, the order of the "beginning of stream" (BOS) pages which determines track order [OGG].

  2. Determining the type of track

    A user agent recognises and supports data from a Ogg resource as being equivalent to a HTML track based on the value of the 'Role' field of the fisbone header in Ogg Skeleton:

    • text track: 'Role' starts with "text"
    • video track: 'Role' starts with "video"
    • audio track: 'Role' starts with "audio"

    If no Skeleton track is available, determine the type based on the codec used in the BOS pages, e.g. Vorbis is an "audio" track and "theora" is a video track.

  3. Track Attributes for sourced Text Tracks

    Attribute How to source its value
    id Content of the 'name' message header field of the fisbone header in Ogg Skeleton. If no Skeleton header is available, use a decimal representation of the stream's serialnumber as given in the BOS.
    kind

    Map the content of the 'Role' message header fields of Ogg Skeleton as follows:

    • "captions": 'Role' is "text/captions“
    • "subtitles": 'Role' is "text/subtitle" or "text/karaoke“
    • "descriptions": 'Role' is "text/textaudiodesc“
    • "chapters": 'Role' is "text/chapters"
    • "metadata": otherwise
    label Content of the 'title' message header field of the fisbone header. If no Skeleton header is available, the empty string.
    language Content of the 'language' message header field of the fisbone header. If no Skeleton header is available, the empty string.
    inBandMetadataTrackDispatchType If @kind is "metadata", then the value of the 'Role' header field. The empty string otherwise.
  4. Track Attributes for sourced Audio and Video Tracks

    Attribute How to source its value
    id Content of the 'name' message header field of the fisbone header in Ogg Skeleton. If no Skeleton header is available, use a decimal representation of the stream's serialnumber as given in the BOS.
    kind

    Map the content of the 'Role' message header fields of Ogg Skeleton as follows:

    • "alternative": 'Role' is "audio/alternate" or "video/alternate"
    • "captions": 'Role' is "video/captioned"
    • "descriptions": 'Role' is "audio/audiodesc"
    • "main": 'Role' is "audio/main" or "video/main"
    • "main-desc": 'Role' is "audio/described"
    • "sign": 'Role' is "video/sign"
    • "subtitles": 'Role' is "video/subtitled"
    • "translation": 'Role' is "audio/dub"
    • "commentary": 'Role' is "audio/commentary"
    • "": otherwise
    label Content of the 'title' message header field of the fisbone header. If no Skeleton header is available, the empty string.
    language Content of the 'language' message header field of the fisbone header. If no Skeleton header is available, the empty string.
  5. Mapping Text Track content into text track cues

    TBD

A. Acknowledgements

Thanks to all In-band Track Community Group members in helping to create this specification.

Thanks also to the WHATWG and W3C HTML WG where a part of this specification originated.

See a problem? Select text and .

B. References

B.1 Informative references

[3GPP-TT]
Transparent end-to-end Packet switched Streaming Service (PSS) Timed text format (Release 12). URL: http: //www.3gpp.org/ftp/Specs/archive/26_series/26.245/26245-c00.zip
[ATSC52]
Digital Audio Compression (AC-3, E-AC-3). 17 December 2012. URL: http://www.atsc.org/cms/standards/A52-2012(12-17).pdf
[ATSC53-4]
MPEG-2 Video System Characteristics. 7 August 2009. URL: http://www.atsc.org/cms/standards/a53/a_53-Part-4-2009.pdf
[ATSC65]
Program and System Information Protocol for Terrestrial Broadcast and Cable. 7 August 2013. URL: http://www.atsc.org/cms/standards/A65_2013.pdf
[ATSC72-1]
Video System Characteristics of AVC in the ATSC Digital Television System. 18 February 2014. URL: http://www.atsc.org/cms/standards/a72/A72-Part-1-2014.pdf
[CEA708]
Digital Television (DTV) Closed Captioning CEA-708-B. URL: http://www.ce.org/Standards/Standard-Listings/R4-3-Television-Data-Systems-Subcommittee/CEA-708-D.aspx
[DVB-SI]
ETSI EN 300 468: "Digital Video Broadcasting (DVB); Specification for Service Information (SI) in DVB systems". URL: http://www.etsi.org/deliver/etsi_en/300400_300499/300468/01.14.01_60/en_300468v011401p.pdf
[DVB-SUB]
ETSI EN 300 743: "Digital Video Broadcasting (DVB); Subtitling systems". URL: http://www.etsi.org/deliver/etsi_en/300700_300799/300743/01.05.01_60/en_300743v010501p.pdf
[DVB-TXT]
ETSI EN 300 472: "Digital Video Broadcasting (DVB); Specification for conveying ITU-R System B Teletext in DVB bitstreams". URL: http://www.etsi.org/deliver/etsi_en/300400_300499/300472/01.03.01_60/en_300472v010301p.pdf
[DVB-VBI]
ETSI EN 301 775: ""Digital Video Broadcasting (DVB); Specification for the carriage of Vertical Blanking Information (VBI) data in DVB bitstreams. URL: http://www.etsi.org/deliver/etsi_en/301700_301799/301775/01.02.01_60/en_301775v010201p.pdf
[HTML]
Ian Hickson. HTML. Living Standard. URL: https://html.spec.whatwg.org/
[HTML5]
Robin Berjon; Steve Faulkner; Travis Leithead; Erika Doyle Navara; Edward O'Connor; Silvia Pfeiffer. HTML5. 16 September 2014. W3C Proposed Recommendation. URL: http://www.w3.org/TR/html5/
[ISO14496-30]
Information technology — Coding of audio-visual objects — Part 30: Timed text and other visual overlays in ISO base media file format. 11 March 2014. URL: http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=63107
[ISOBMFF]
Information technology -- Coding of audio-visual objects -- Part 12: ISO base media file format ISO/IEC 14496-12:2012. URL: http://standards.iso.org/ittf/PubliclyAvailableStandards/c061988_ISO_IEC_14496-12_2012.zip
[MPEG2TS]
Information technology -- Generic coding of moving pictures and associated audio information: Systems ITU-T Rec. H.222.0 / ISO/IEC 13818-1:2013. URL: http://www.itu.int/rec/T-REC-H.222.0-201206-I
[MPEGDASH]
ISO/IEC 23009-1:2014 Information technology -- Dynamic adaptive streaming over HTTP (DASH) -- Part 1: Media presentation description and segment formats. URL: http://standards.iso.org/ittf/PubliclyAvailableStandards/c065274_ISO_IEC_23009-1_2014.zip
[Matroska]
Matroska Specifications. 9 January 2014. URL: http://matroska.org/technical/specs/index.html
[OGG]
S. Pfeiffer. The Ogg Encapsulation Format Version 0 (RFC 3533). May 2003. RFC. URL: http://www.ietf.org/rfc/rfc3533.txt
[OGGSKELETON]
Ogg Skeleton 4 Message Headers. 17 March 2014. URL: http://wiki.xiph.org/SkeletonHeaders
[SCTE128-1]
ANSI/SCTE 128-1 2013 AVC Constraints for Cable Television Part 1- Coding. URL: http://www.scte.org/documents/pdf/Standards/ANSI_SCTE%20128-1%202013.pdf
[SCTE193-2]
SCTE 193-2 2014 MPEG-4 AAC Family Audio System – Part 2 Constraints for Carriage over MPEG-2 Transport. URL: http://www.scte.org/documents/pdf/standards/SCTE%20193-2%202014.pdf
[SCTE27]
Subtitling Methods For Broadcast Cable. URL: http://www.scte.org/documents/pdf/Standards/ANSI_SCTE_27_2011.pdf
[SMPTE2052-11]
Conversion from CEA-708 Caption Data to SMPTE-TT. URL: https: //www.smpte.org/sites/default/files/RP2052-11-2013.pdf
[VTT708]
Silvia Pfeiffer. Conversion of 608/708 captions to WebVTT. Draft Community Group Report. URL: https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html
[WEBVTT]
Silvia Pfeiffer; Philip Jägenstedt; Ian Hickson. WebVTT: The Web Video Text Tracks Format. 16 May 2014. W3C Editor's Draft. URL: http://dev.w3.org/html5/webvtt/
[WEBVTT-WEBM]
Matthew Heaney; Frank Galligan. Embedding WebVTT in WebM. 1 February 2012. URL: http://wiki.webmproject.org/webm-metadata/temporal-metadata/webvtt-in-webm
[WebM]
WebM Container Guidelines. 28 April 2014. URL: http://www.webmproject.org/code/specs/container/
[ttaf1-dfxp]
Glenn Adams. Timed Text Markup Language (TTML) 1.0 (Second Edition). 9 July 2013. W3C Proposed Edited Recommendation. URL: http://www.w3.org/TR/ttaf1-dfxp/