Network Working Group D. Connolly Internet-Draft World Wide Web Consortium (W3C) Category: Informational Aug 2007 HTML 5 rules for determining content types Status: Internet-draft-to-be I'm looking for a co-author to help route feedback from the IETF to the W3C HTML WG. @@ Please send comments to public-html-comments@w3.org $Revision: 1.1 $ of $Date: 2007-08-17 20:35:38 $ Introduction The HTTP specification [HTTP] , in section 14.17 Content-Type , says " The Content-Type entity-header field indicates the media type of the entity-body sent to the recipient " . The HTML 5 specification [HTML5] specifies an algorithm for determining content types based on widely deployed practices and software. These specifications conflict in some cases. (@@ extract a test cases from Step 10 of Feed/HTML sniffing (part of detailed review of "Determining the type of a new resource in a browsing context") ) According to a straightforward architecture for content types in the Web [META] , the HTTP specification should suffice and the HTML 5 specification need not specify another algorithm. But that architecture assumes that Web publishers (server adminstrators and content developers) reliably label content. Observing that labelling by Web publishers is widely unreliable, and software that works around these problems is widespread, the choices seem to be: * Convince Web publishers to fix incorrectly labelled Web content and label it correctly in the future. * Update the HTTP specification to match widely deployed conventions captured in the HTML 5 draft. While the second option is unappealing, the first option seems infeasible. Connolly, Dan [Page 1] Internet Draft HTML 5 rules for determining content types Aug 2007 The IETF community is invited to review the details of the HTML 5 algorithm in detail. 4.7. Determining the type of a new resource in a browsing context It is imperative that the rules in this section be followed exactly. When two user agents use different heuristics for content type detection, security problems can occur. For example, if a server believes a contributed file to be an image (and thus benign), but a Web browser believes the content to be HTML (and thus capable of executing script), the end user can be exposed to malicious content, making the user vulnerable to cookie theft attacks and other cross-site scripting attacks. The sniffed type of a resource must be found as follows: 1 If the resource was fetched over an HTTP protocol, and there is no HTTP Content-Encoding header, but there is an HTTP Content-Type header and it has a value whose bytes exactly match one of the following three lines: Bytes in Hexadecimal Textual representation Connolly, Dan [Page 2] 74 65 78 74 2f 70 6c 61 69 6e text/plain 74 65 78 74 2f 70 6c 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 49 53 4f 2d 38 38 35 39 2d 31 text/plain; charset=ISO-8859-1 74 65 78 74 2f 70 6c 61 69 6e 3b 20 63 68 61 72 73 65 74 3d 69 73 6f 2d 38 38 35 39 2d 31 text/plain; charset=iso-8859-1 ...then jump to the text or binary section below. 0 Let official type be the type given by the Content-Type metadata for the resource (in lowercase , ignoring any parameters). If there is no such type, jump to the unknown type step below. ...or if the type has no slash or is */*? Probably we should also sniff in that case. 0 If official type ends in "+xml", or if it is either "text/xml" or "application/xml", then the the sniffed type of the resource is official type ; return that and abort these steps. 0 If official type is an image type supported by the user agent (e.g. "image/png", "image/gif", "image/jpeg", etc), then jump to the images section below. 0 If official type is "text/html", then jump to the feed or HTML section below. 0 Otherwise, the sniffed type of the resource is official type . 4.7.1. Content-Type sniffing: text or binary 1 The user agent may wait for 512 or more bytes of the resource to be available. 2 Let n be the smaller of either 512 or the number of bytes already available. 3 If n is 4 or more, and the first bytes of the file match one of the following byte sets: Bytes in Hexadecimal Description FE FF UTF-16BE BOM or UTF-32LE BOM FF FE UTF-16LE BOM 00 00 FE FF UTF-32BE BOM EF BB BF UTF-8 BOM Connolly, Dan [Page 3] Internet Draft HTML 5 rules for determining content types Aug 2007 ...then the sniffed type of the resource is "text/plain". Otherwise, if any of the first n bytes of the resource are in one of the following byte ranges: * 0x00 - 0x08 * 0x0E - 0x1A * 0x1C - 0x1F ...then the sniffed type of the resource is "application/octet-stream". Otherwise, the sniffed type of the resource is "text/plain". 4.7.2. Content-Type sniffing: unknown type 1 The user agent may wait for 512 or more bytes of the resource to be available. 2 Let stream length be the smaller of either 512 or the number of bytes already available. 3 For each row in the table below: 1 Let pattern length be the length of the pattern (number of bytes described by the cell in the second column of the row). 2 If pattern length is smaller than stream length then skip this row. 3 Apply the "and" operator to the first pattern length bytes of the resource and the given mask (the bytes in the cell of first column of that row), and let the result be the data . 4 If the bytes of the data matches the given pattern bytes exactly, then the sniffed type of the resource is the type given in the cell of the third column in that row; abort these steps. 4 As a last-ditch effort, jump to the text or binary section. Bytes in Hexadecimal Sniffed type Comment Mask Pattern FF FF DF DF DF DF DF DF DF FF DF DF DF DF 3C 21 44 4F 43 54 59 50 45 20 48 54 4D 4C text/html The string " "), then increase pos by 3 and jump back to the previous step (step 5) in the overall algorithm in this section. 3 Otherwise, increase pos by 1. 4 Otherwise, return to step 2 in these substeps. 7 If s [ pos ] is 0x21 (ASCII " ! "): 1 Increase pos by 1. 2 If s [ pos ] equal 0x3E, then increase pos by 1 and jump back to step 5 in the overall algorithm in this section. 3 Otherwise, return to step 1 in these substeps. 8 If s [ pos ] is 0x3F (ASCII " ? "): 1 Increase pos by 1. 2 If s [ pos ] and s [ pos +1] equal 0x3F and 0x3E respectively, then increase pos by 1 and jump back to step 5 in the overall algorithm in this section. 3 Otherwise, return to step 1 in these substeps. 9 Otherwise, if the bytes in s starting at pos match any of the sequences of bytes in the first column of the following table, then the user agent must follow the steps given in the corresponding cell in the second column of the same row. Bytes in Hexadecimal Requirement Comment 72 73 73 The sniffed type of the resource is "application/rss+xml"; abort these steps The three ASCII characters " rss " Connolly, Dan [Page 6] Internet Draft HTML 5 rules for determining content types Aug 2007 66 65 65 64 The sniffed type of the resource is "application/atom+xml"; abort these steps The four ASCII characters " feed " 72 64 66 3A 52 44 46 Continue to the next step in this algorithm The ASCII characters " rdf:RDF " If none of the byte sequences above match the bytes in s starting at pos , then the sniffed type of the resource is "text/html". Abort these steps. If, before the next ">", you find two xmlns* attributes with http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/rss/1.0/ as the namespaces, then the sniffed type of the resource is "application/rss+xml", abort these steps. (maybe we only need to check for http://purl.org/rss/1.0/ actually) Otherwise, the sniffed type of the resource is "text/html". For efficiency reaons, implementations may wish to implement this algorithm and the algorithm for detecting the character encoding of HTML documents in parallel. 4.7.5. Content-Type metadata What explicit Content-Type metadata is associated with the resource (the resource's type information) depends on the protocol that was used to fetch the resource. For HTTP resources, only the Content-Type HTTP header contributes any data; the explicit type of the resource is then the value of that header, interpreted as described by the HTTP specifications. [HTTP] For resources fetched from the filesystem, user agents should use platform-specific conventions, e.g. operating system extension/type mappings. Extensions must not be used for determining resource types for resources fetched over HTTP. For resources fetched over most other protocols, e.g. FTP, there is no type information. The algorithm for extracting an encoding from a Content-Type , given a string s , is as follows. It either returns a encoding or nothing. 1 Skip characters in s up to and including the first U+003B SEMICOLON ( ; ) character. 2 Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters (i.e. spaces) that immediately follow the semicolon. 3 Connolly, Dan [Page 7] Internet Draft HTML 5 rules for determining content types Aug 2007 If the next six characters are not 'charset', return nothing. 4 Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters that immediately follow the word 'charset' (there might not be any). 5 If the next character is not a U+003D EQUALS SIGN ('='), return nothing. 6 Skip any U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 characters that immediately follow the word equals sign (there might not be any). 7 Process the next character as follows: If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s Return string between the two quotation marks. If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s Return the string between the two apostrophes. If it is an unmatched U+0022 QUOTATION MARK ('"') If it is an unmatched U+0027 APOSTROPHE ("'") Return nothing. Otherwise Return the string from this character to the first U+0009, U+000A, U+000B, U+000C, U+000D, or U+0020 character or the end of s , whichever comes first. Acknowledgements/@@Fodder @@more context; meanwhile, see: Step 10 of Feed/HTML sniffing (part of detailed review of "Determining the type of a new resource in a browsing context") 17 Aug 007. Jim Davis for his HTML->internet-draft tool ( Makefile ). Also keeping an eye on Transforming RFC2629-formatted XML through XSLT , but still grumpy that that format is so arbitrarily different from HTML. Connolly, Dan [Page 8] Internet Draft HTML 5 rules for determining content types Aug 2007 ietf-xml-mime mailing list Author's Address Daniel W. Connolly World Wide Web Consortum (W3C) 32 Vassar Street Cambridge, MA 02139, U.S.A. mailto:connolly@w3.org http://www.w3.org/People/Connolly/ References [HTML5] HTML 5 , work in progress 10 August 2007, Hickson and Hyatt, eds. [HTTP] Hypertext Transfer Protocol -- HTTP/1.1 RFC2616 June 1999 [META] Authoritative Metadata , W3C TAG April 2006 Connolly, Dan [Page 9]