HTML 5: The Markup Language

4. HTML syntax # T

This section describes the the HTML syntax in detail. In places, it also notes differences between the the HTML syntax and the XML syntax, but it does not describe the XML syntax in detail (the XML syntax is instead defined by rules in the XML specification [XML] and in the Namespaces in XML 1.0 specification [XMLNS]).

This section is divided into the following parts:

4.01. The doctype # T

A doctype (sometimes capitalized as “DOCTYPE”) is an special instruction which, for legacy reasons that have to do with processing modes in browsers, is a required part of any document in the HTML syntax; it must either be a deprecated doctype, or must consist of the following parts, in exactly the following order:

  1. A "<" character.
  2. A "!" character.
  3. Any case-insensitive match for the string "DOCTYPE".
  4. One or more space characters.
  5. Any case-insensitive match for the string "HTML".
  6. Optionally, a doctype legacy string.
  7. Optionally, one or more space characters.
  8. A ">" character.

A doctype legacy string consists of the following parts, in exactly the following order.

  1. One or more space characters.
  2. Any case-insensitive match for the string "SYSTEM".
  3. One or more space characters
  4. A quote mark, consisting of either a """ character or a "'" character.
  5. The literal string "about:legacy-compat".
  6. A matching quote mark, identical to the quote mark used earlier (either a """ character or a "'" character).

The following are examples of some conformant doctypes.

<!DOCTYPE html>
<!doctype HTML system "about:legacy-compat">

A deprecated doctype is a document type declaration as defined in the XML specification [XML], with the further restriction that it must meet one of the following sets of constraints:

The following are examples of some deprecated doctypes.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd">

4.02. Character encoding declaration # T

A character encoding declaration is a mechanism for specifying the character encoding used to store or transmit a document.

The following restrictions apply to character encoding declarations:

If the document does not start with a U+FEFF BYTE ORDER MARK (BOM) character, and if its encoding is not explicitly given by a Content-Type HTTP header, then the character encoding used must be an ASCII-compatible character encoding, and, in addition, if that encoding isn't US-ASCII itself, then the encoding must be specified using a meta element with a charset attribute or a meta element in the encoding declaration state.

If the document contains a meta element with a charset attribute or a meta element in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding.

An ASCII-compatible character encoding is one that is a superset of US-ASCII (specifically, ANSI_X3.4-1968) for bytes in the set 0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A.

Documents must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings. [CESU8] [UTF7] [BOCU1] [SCSU]

In a document the XML syntax, the XML declaration, as defined in the XML specification [XML] should be used to provide character-encoding information, if necessary.

4.03. Elements # T

An element’s content model defines the element’s structure: What contents (if any) the element can contain, as well as what attributes (if any) the element can have. The HTML elements section of this specification defines the content models for all of elements that are part of the HTML language. An element must not contain contents or attributes that are not part of its content model.

The contents of an element are any elements, character data, and comments that it contains. Attributes and their values are not considered to be the “contents” of an element.

A void element is an element whose content model does not allow it to have contents. Void elements can have attributes.

The following is a complete list of the void elements in HTML:

The following list describes syntax rules for the the HTML syntax. Rules for the the XML syntax are defined in the XML specification [XML].

4.3.01. Misnested tags #

If an element has both a start tag and an end tag, its end tag must be contained within the contents of the same element in which its start tag is contained. An end tag that is not contained within the same contents as its start tag is said to be a misnested tag.

In the following example, the "</i>" end tag is a misnested tag, because it is not contained within the contents of the b element that contains its corresponding "<i>" start tag.

<b>foo <i>bar</b> baz</i>

4.04. Attributes # T

Attributes for an element are expressed inside the element’s start tag. Attributes have a name and a value.

There must never be two or more attributes on the same start tag whose names are a case-insensitive match for each other.

The following list describes syntax rules for attributes in documents in the HTML syntax. Syntax rules for attributes in documents in the XML syntax. are defined in the XML specification [XML].

In the the HTML syntax, attributes can be specified in four different ways:

  1. empty attribute syntax
  2. unquoted attribute-value syntax
  3. single-quoted attribute-value syntax
  4. double-quoted attribute-value syntax
Empty attribute syntax

Certain attributes may be specified by providing just the attribute name.

In the following example, the disabled attribute is given with the empty attribute syntax:

<input disabled>
Unquoted attribute-value syntax

An unquoted attribute value is specified by providing the following parts in exactly the following order:

  1. an attribute name
  2. zero or more space characters
  3. a single "=" character
  4. zero or more space characters
  5. an attribute value

In addition to the general requirements given above for attribute values, an unquoted attribute value has the following restrictions:

  • must not contain any literal space characters
  • must not contain any """, "'", ">", "=", characters
  • must not be the empty string

In the following example, the value attribute is given with the unquoted attribute value syntax:

<input value=yes>

If the value of an attribute using the unquoted attribute syntax is followed by a "/" character, then there must be at least one space character after the value and before the "/" character.

Single-quoted attribute-value syntax

A single-quoted attribute value is specified by providing the following parts in exactly the following order:

  1. an attribute name
  2. zero or more space characters
  3. a "=" character
  4. zero or more space characters
  5. a single "'" character
  6. an attribute value
  7. a "'" character.

In addition to the general requirements given above for attribute values, a single-quoted attribute value has the following restriction:

  • must not contain any literal "'" characters

In the following example, the type attribute is given with the single-quoted attribute value syntax:

<input type='checkbox'>
Double-quoted attribute-value syntax

A double-quoted attribute value is specified by providing the following parts in exactly the following order:

  1. an attribute name
  2. zero or more space characters
  3. a single "=" character
  4. zero or more space characters
  5. a single """ character
  6. an attribute value
  7. a """ character

In addition to the general requirements given above for attribute values, a double-quoted attribute value has the following restriction:

  • must not contain any literal """ characters

In the following example, the title attribute is given with the double-quoted attribute value syntax:

<code title="U+003C LESS-THAN SIGN">&lt;</code>

4.05. Text and character data # T

Text in element contents (including in comments) and attribute values must consist of Unicode characters, with the following restrictions:

There are two special types of text, known as escaping text span starts and escaping text span ends, that can occur within certain elements.

Character data contains text, in some cases in combination with character references, along with certain additional restrictions. There are three types of character data that can occur in documents:

  1. normal character data
  2. replaceable character data
  3. non-replaceable character data
Normal character data

Certain elements and strings in the values of particular attributes contain normal character data. Normal character data can contain the following:

Normal character data has the following restrictions:

Replaceable character data

In documents in the HTML syntax, the title and textarea elements can contain replaceable character data. Replaceable character data can contain the following:

Replaceable character data has the following restrictions:

  • must not contain any ambiguous ampersands
  • must not contain any occurrences of the string "</" followed by characters that are a case-insensitive match for the tag name of the element containing the replaceable character data (for example, "</title" or "</textarea"), followed by one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+0020 SPACE, ">", or "/", unless that string is part of an escaping text span.

Replaceable character data, as defined in this specification, is a feature of the HTML syntax that is not available in the XML syntax. Documents in the XML syntax must not contain replaceable character data as defined in this specification; instead they must conform to all syntax constraints defined in the XML specification [XML].

Non-replaceable character data

In documents in the HTML syntax, the script, and style elements can contain non-replaceable character data. Non-replaceable character data can contain the following:

Non-replaceable character data has the following restrictions:

  • must not contain character references
  • must not contain any occurrences of the string "</", followed by characters that are a case-insensitive match for the tag name of the element containing the replaceable character data (for example, "</script" or "</style", followed by one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+0020 SPACE, ">", or "/", unless that string is part of an escaping text span.

Non-replaceable character data, as defined in this specification, is a feature of the HTML syntax that is not available in the XML syntax. Documents in the XML syntax must not contain non-replaceable character data as defined in this specification; instead they must conform to all syntax constraints defined in the XML specification [XML].

4.06. Character references # T

Character references are a form of markup for representing single individual characters. There are three types of character references:

Named character reference

Named character references consist of the following parts in exactly the following order:

  1. An "&" character.
  2. One of the entity names defined in XML Entity definitions for Characters [Entities], using the same case.
  3. A ";" character.

The following is an example of a named character reference for the character "" (U+2020 DAGGER).

&dagger;
Decimal numeric character reference

Decimal numerical character references consist of the following parts, in exactly the following order.

  1. An "&" character.
  2. A "#" character.
  3. One or more digits in the range 0–9, representing a base-ten integer that itself is a Unicode code point that is not U+0000, U+000D, in the range U+0080–U+009F, or in the range 0xD8000–0xDFFF (surrogates).
  4. A ";" character.

The following is an example of a decimal numeric character reference for the character "" (U+2020 DAGGER).

&#8224;
Hexadecimal numeric character reference

Hexadecimal numeric character references consist of the following parts, in exactly the following order.

  1. An "&" character.
  2. A "#" character.
  3. Either a "x" character or a "X" character.
  4. One or more digits in the range 0–9, a–f, and A–F, representing a base-sixteen integer that itself is a Unicode code point that is not U+0000, U+000D, in the range U+0080–U+009F, or in the range 0xD800–0xDFFF (surrogates).
  5. A ";" character.

The following is an example of a hexadecimal numeric character reference for the character "" (U+2020 DAGGER).

&#x2020;

Character references are not themselves text, and no part of a character reference is text.

An ambiguous ampersand is an "&" character that is followed by some text other than a space character, a "<", character, or another "&" character.

4.07. Comments # T

Comments consist of the following parts, in exactly the following order:

  1. the comment start delimiter "<!--"
  2. text
  3. the comment end delimiter "-->"

The text part of comments has the following restrictions:

The following is an example of a comment.

<!-- main content starts here -->

4.08. Escaping text spans # T

An escaping text span is a span of text that starts with an escaping text span start that is not itself in an escaping text span, and ends at the next escaping text span end. Escaping text spans have the following restriction:

An escaping text span start is the text string "<!--".

An escaping text span end is the text string "-->".

An escaping text span start may share its "-" characters with its corresponding escaping text span end.

The text in style, script, title, and textarea elements must not have an escaping text span start that is not followed by an escaping text span end.

The following is an example of an escaping text span within a style element.

<style>
<!--
dfn { font-weight: bold; color: brown; }
-->
</style>

4.09. SVG and MathML elements in HTML documents # T

SVG and MathML elements are elements from the SVG and MathML namespaces. SVG and MathML elements can be used both in documents in the HTML syntax and in documents in the XML syntax. Syntax rules for SVG and MathML elements in documents in the XML syntax are defined in the XML specification [XML]. The following list describes additional syntax rules that specifically apply to SVG and MathML elements in documents in the HTML syntax.

4.10. CDATA sections in SVG and MathML contents # T

CDATA sections in SVG and MathML contents in documents in the HTML syntax consist of the following parts, in exactly the following order:

  1. the CDATA start delimiter "<![CDATA["
  2. text, with the additional restriction that the text must not contain the string "]]>
  3. the CDATA end delimiter "]]>"

CDATA sections are allowed only in the contents of elements from the SVG and MathML namespaces.

The following shows an example of a CDATA section.

<annotation encoding="text/latex">
  <![CDATA[\documentclass{article}
  \begin{document}
  \title{E}
  \maketitle
  The base of the natural logarithms, approximately 2.71828.
  \end{document}]]>
</annotation>