Network Working Group                                        D. Connolly
Internet-Draft                           World Wide Web Consortium (W3C)
Category: Informational                                         Aug 2007
<draft-connolly-html5-type-sniffing-00.txt>

HTML 5 rules for determining content types

Status: Internet-draft-to-be

I'm looking for a co-author to help route feedback from the IETF to the W3C HTML WG. @@

Please send comments to public-html-comments@w3.org

$Revision: 1.1 $ of $Date: 2007-08-17 20:35:38 $

Introduction

The HTTP specification[HTTP], in section 14.17 Content-Type, says The Content-Type entity-header field indicates the media type of the entity-body sent to the recipient.

The HTML 5 specification[HTML5] specifies an algorithm for determining content types based on widely deployed practices and software.

These specifications conflict in some cases. (@@ extract a test cases from Step 10 of Feed/HTML sniffing (part of detailed review of "Determining the type of a new resource in a browsing context"))

According to a straightforward architecture for content types in the Web[META], the HTTP specification should suffice and the HTML 5 specification need not specify another algorithm. But that architecture assumes that Web publishers (server adminstrators and content developers) reliably label content. Observing that labelling by Web publishers is widely unreliable, and software that works around these problems is widespread, the choices seem to be:

While the second option is unappealing, the first option seems infeasible.

The IETF community is invited to review the details of the HTML 5 algorithm in detail.

4.7. Determining the type of a new resource in a browsing context

It is imperative that the rules in this section be followed exactly. When two user agents use different heuristics for content type detection, security problems can occur. For example, if a server believes a contributed file to be an image (and thus benign), but a Web browser believes the content to be HTML (and thus capable of executing script), the end user can be exposed to malicious content, making the user vulnerable to cookie theft attacks and other cross-site scripting attacks.

The sniffed type of a resource must be found as follows:

  1. If the resource was fetched over an HTTP protocol, and there is no HTTP Content-Encoding header, but there is an HTTP Content-Type header and it has a value whose bytes exactly match one of the following three lines:

    Bytes in Hexadecimal Textual representation
    74 65 78 74 2f 70 6c 61 69 6e text/plain
    74 65 78 74 2f 70 6c 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 49 53 4f 2d 38 38 35 39 2d 31 text/plain; charset=ISO-8859-1
    74 65 78 74 2f 70 6c 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 69 73 6f 2d 38 38 35 39 2d 31 text/plain; charset=iso-8859-1

    ...then jump to the text or binary section below.

  2. Let official type be the type given by the Content-Type metadata for the resource (in lowercase , ignoring any parameters). If there is no such type, jump to the unknown type step below.

    ...or if the type has no slash or is */*? Probably we should also sniff in that case.

  3. If official type ends in "+xml", or if it is either "text/xml" or "application/xml", then the the sniffed type of the resource is official type; return that and abort these steps.

  4. If official type is an image type supported by the user agent (e.g. "image/png", "image/gif", "image/jpeg", etc), then jump to the images section below.

  5. If official type is "text/html", then jump to the feed or HTML section below.

  6. Otherwise, the sniffed type of the resource is official type.

4.7.1. Content-Type sniffing: text or binary

  1. The user agent may wait for 512 or more bytes of the resource to be available.

  2. Let n be the smaller of either 512 or the number of bytes already available.

  3. If n is 4 or more, and the first bytes of the file match one of the following byte sets:

    Bytes in Hexadecimal Description
    FE FF UTF-16BE BOM or UTF-32LE BOM
    FF FE UTF-16LE BOM
    00 00 FE FF UTF-32BE BOM
    EF BB BF UTF-8 BOM

    ...then the sniffed type of the resource is "text/plain".

  4. Otherwise, if any of the first n bytes of the resource are in one of the following byte ranges:

    • 0x00 - 0x08
    • 0x0E - 0x1A
    • 0x1C - 0x1F

    ...then the sniffed type of the resource is "application/octet-stream".

  5. Otherwise, the sniffed type of the resource is "text/plain".

4.7.2. Content-Type sniffing: unknown type

  1. The user agent may wait for 512 or more bytes of the resource to be available.

  2. Let stream length be the smaller of either 512 or the number of bytes already available.

  3. For each row in the table below:

    1. Let pattern length be the length of the pattern (number of bytes described by the cell in the second column of the row).
    2. If pattern length is smaller than stream length then skip this row.
    3. Apply the "and" operator to the first pattern length bytes of the resource and the given mask (the bytes in the cell of first column of that row), and let the result be the data.
    4. If the bytes of the data matches the given pattern bytes exactly, then the sniffed type of the resource is the type given in the cell of the third column in that row; abort these steps.
  4. As a last-ditch effort, jump to the text or binary section.

Bytes in Hexadecimal Sniffed type Comment
Mask Pattern
FF FF DF DF DF DF DF DF DF FF DF DF DF DF 3C 21 44 4F 43 54 59 50 45 20 48 54 4D 4C text/html The string "<!DOCTYPE HTML" in US-ASCII or compatible encodings, case-insensitively.
FF DF DF DF DF 3C 48 54 4D 4C text/html The string "<HTML" in US-ASCII or compatible encodings, case-insensitively.
FF DF DF DF DF DF DF 3C 53 43 52 49 50 54 text/html The string "<SCRIPT" in US-ASCII or compatible encodings, case-insensitively.
FF FF FF FF FF 25 50 44 46 2D application/pdf The string "%PDF-", the PDF signature.
FF FF FF FF FF FF FF FF FF FF FF 25 21 50 53 2D 41 64 6F 62 65 2D application/postscript The string "%!PS-Adobe-", the PostScript signature.
FF FF FF FF FF FF 47 49 46 38 37 61 image/gif The string "GIF87a", a GIF signature.
FF FF FF FF FF FF 47 49 46 38 39 61 image/gif The string "GIF89a", a GIF signature.
FF FF FF FF FF FF FF FF 89 50 4E 47 0D 0A 1A 0A image/png The PNG signature.
FF FF FF FF D8 FF image/jpeg A JPEG SOI marker followed by the first byte of another marker.
FF FF 42 4D image/bmp The string "BM", a BMP signature.

User agents may support further types if desired, by implicitly adding to the above table. However, user agents should not use any other patterns for types already mentioned in the table above, as this could then be used for privilege escalation (where, e.g., a server uses the above table to determine that content is not HTML and thus safe from XSS attacks, but then a user agent detects it as HTML anyway and allows script to execute).

4.7.3. Content-Type sniffing: image

If the first bytes of the file match one of the byte sequences in the first columns of the following table, then the sniffed type of the resource is the type given in the corresponding cell in the second column on the same row:

Bytes in Hexadecimal Sniffed type Comment
47 49 46 38 37 61 image/gif The string "GIF87a", a GIF signature.
47 49 46 38 39 61 image/gif The string "GIF89a", a GIF signature.
89 50 4E 47 0D 0A 1A 0A image/png The PNG signature.
FF D8 FF image/jpeg A JPEG SOI marker followed by the first byte of another marker.
42 4D image/bmp The string "BM", a BMP signature.

User agents must ignore any rows for image types that they do not support.

Otherwise, the sniffed type of the resource is the same as its official type.

4.7.4. Content-Type sniffing: feed or HTML

  1. The user agent may wait for 512 or more bytes of the resource to be available.

  2. Let s be the stream of bytes, and let s[i] represent the byte in s with position i, treating s as zero-indexed (so the first byte is at i=0).

  3. If at any point this algorithm requires the user agent to determine the value of a byte in s which is not yet available, or which is past the first 512 bytes of the resource, or which is beyond the end of the resource, the user agent must stop this algorithm, and assume that the sniffed type of the resource is "text/html".

    User agents are allowed, by the first step of this algorithm, to wait until the first 512 bytes of the resource are available.

  4. Initialise pos to 0.

  5. Examine s[pos].

    If it is 0x09 (ASCII tab), 0x20 (ASCII space), 0x0A (ASCII LF), or 0x0D (ASCII CR)
    Increase pos by 1 and repeat this step.
    If it is 0x3C (ASCII "<")
    Increase pos by 1 and go to the next step.
    If it is anything else
    The sniffed type of the resource is "text/html". Abort these steps.
  6. If the bytes with positions pos to pos+2 in s are exactly equal to 0x21, 0x2D, 0x2D respectively (ASCII for "!--"), then:

    1. Increase pos by 3.
    2. If the bytes with positions pos to pos+2 in s are exactly equal to 0x2D, 0x2D, 0x3E respectively (ASCII for "-->"), then increase pos by 3 and jump back to the previous step (step 5) in the overall algorithm in this section.
    3. Otherwise, increase pos by 1.
    4. Otherwise, return to step 2 in these substeps.
  7. If s[pos] is 0x21 (ASCII "!"):

    1. Increase pos by 1.
    2. If s[pos] equal 0x3E, then increase pos by 1 and jump back to step 5 in the overall algorithm in this section.
    3. Otherwise, return to step 1 in these substeps.
  8. If s[pos] is 0x3F (ASCII "?"):

    1. Increase pos by 1.
    2. If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively, then increase pos by 1 and jump back to step 5 in the overall algorithm in this section.
    3. Otherwise, return to step 1 in these substeps.
  9. Otherwise, if the bytes in s starting at pos match any of the sequences of bytes in the first column of the following table, then the user agent must follow the steps given in the corresponding cell in the second column of the same row.

    Bytes in Hexadecimal Requirement Comment
    72 73 73 The sniffed type of the resource is "application/rss+xml"; abort these steps The three ASCII characters "rss"
    66 65 65 64 The sniffed type of the resource is "application/atom+xml"; abort these steps The four ASCII characters "feed"
    72 64 66 3A 52 44 46 Continue to the next step in this algorithm The ASCII characters "rdf:RDF"

    If none of the byte sequences above match the bytes in s starting at pos, then the sniffed type of the resource is "text/html". Abort these steps.

  10. If, before the next ">", you find two xmlns* attributes with http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/rss/1.0/ as the namespaces, then the sniffed type of the resource is "application/rss+xml", abort these steps. (maybe we only need to check for http://purl.org/rss/1.0/ actually)

  11. Otherwise, the sniffed type of the resource is "text/html".

For efficiency reaons, implementations may wish to implement this algorithm and the algorithm for detecting the character encoding of HTML documents in parallel.

4.7.5. Content-Type metadata

What explicit Content-Type metadata is associated with the resource (the resource's type information) depends on the protocol that was used to fetch the resource.

For HTTP resources, only the Content-Type HTTP header contributes any data; the explicit type of the resource is then the value of that header, interpreted as described by the HTTP specifications. [HTTP]

For resources fetched from the filesystem, user agents should use platform-specific conventions, e.g. operating system extension/type mappings.

Extensions must not be used for determining resource types for resources fetched over HTTP.

For resources fetched over most other protocols, e.g. FTP, there is no type information.

The algorithm for extracting an encoding from a Content-Type, given a string s, is as follows. It either returns a encoding or nothing.

  1. Skip characters in s up to and including the first U+003B SEMICOLON (;) character.

  2. Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters (i.e. spaces) that immediately follow the semicolon.

  3. If the next six characters are not 'charset', return nothing.

  4. Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters that immediately follow the word 'charset' (there might not be any).

  5. If the next character is not a U+003D EQUALS SIGN ('='), return nothing.

  6. Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters that immediately follow the word equals sign (there might not be any).

  7. Process the next character as follows:

    If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s

    Return string between the two quotation marks.

    If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s

    Return the string between the two apostrophes.

    If it is an unmatched U+0022 QUOTATION MARK ('"')
    If it is an unmatched U+0027 APOSTROPHE ("'")

    Return nothing.

    Otherwise

    Return the string from this character to the first U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 character or the end of s, whichever comes first.

Acknowledgements/@@Fodder

@@more context; meanwhile, see: Step 10 of Feed/HTML sniffing (part of detailed review of "Determining the type of a new resource in a browsing context") 17 Aug 007.

Jim Davis for his HTML->internet-draft tool (Makefile). Also keeping an eye on Transforming RFC2629-formatted XML through XSLT, but still grumpy that that format is so arbitrarily different from HTML.

ietf-xml-mime mailing list

Author's Address

Daniel W. Connolly
World Wide Web Consortum (W3C)
32 Vassar Street Cambridge, MA 02139, U.S.A.
mailto:connolly@w3.org
http://www.w3.org/People/Connolly/

References

[HTML5]
HTML 5, work in progress 10 August 2007, Hickson and Hyatt, eds.
[HTTP]
Hypertext Transfer Protocol -- HTTP/1.1 RFC2616 June 1999
[META]
Authoritative Metadata, W3C TAG April 2006