3. Documents # T
This section defines the term
document,
and provides additional details related to the definition of
that term. It is divided into the following parts:
3.01. The HTML language and HTML and XML syntaxes # T
The term
document is used in this specification
to mean an instance of the
HTML language.
The
HTML language is the language
described in this specification; it is an abstract language that
applications can potentially represent in memory in any number
of possible ways, and that can be transmitted using any number
of possible concrete syntaxes.
This specification makes
reference to two particular concrete syntaxes for the
HTML language:
One syntax which
is referred to throughout this specification as
the HTML syntax,
and another syntax, which is referred to throughout this
specification as
the XML syntax.
Web browsers typically implement two separate parsers for
processing documents: an
HTML parser
which is invoked when processing documents in the
HTML syntax, and an
XML parser
which is invoked when processing documents in the
XML syntax.
The HTML syntax
is the syntax described in the
HTML syntax
section of this specification.
The
XML syntax
is defined by rules in the XML specification
[XML]
and in the Namespaces in XML 1.0 specification
[XMLNS].
Beyond the requirements defined in those specifications,
this specification does not define any additional syntax-level
requirements for
documents in the XML syntax.
4. HTML syntax # T
This section describes the
the HTML syntax
in detail. In places, it also notes differences between the
the HTML syntax
and
the XML syntax,
but it does not describe the XML syntax in detail (the XML
syntax is instead defined by rules in the XML specification
[XML]
and in the Namespaces in XML 1.0 specification
[XMLNS]).
This section is divided into the following parts:
4.01. The doctype # T
A
doctype
(sometimes capitalized as “DOCTYPE”) is an special instruction
which, for legacy reasons that have to do with processing
modes in browsers, is a required part of any
document in the HTML syntax;
it must either be a
deprecated doctype,
or must consist of the following parts, in exactly the
following order:
- A
"
<"
character.
- A
"
!"
character.
- Any
case-insensitive match
for the string
"
DOCTYPE".
- One or more
space characters.
- Any
case-insensitive match
for the string
"
HTML".
- Optionally, a
doctype legacy string.
- Optionally, one or more
space characters.
- A
"
>"
character.
A
doctype legacy string
consists of the following parts, in exactly the following
order.
- One or more
space characters.
- Any
case-insensitive match
for the string
"
SYSTEM".
- One or more
space characters
- A quote mark, consisting of either
a
"
""
character or a
"'"
character.
- The literal string
"
about:legacy-compat".
- A matching quote mark, identical to the
quote mark used earlier (either a
"
""
character or a
"'"
character).
The following are examples of some conformant
doctypes.
<!DOCTYPE html>
<!doctype HTML system "about:legacy-compat">
A
deprecated doctype
is a
document type declaration
as defined in the XML specification
[XML],
with the further restriction that it must meet one of the
following sets of constraints:
- The
document type declaration’s
name part is a
case-insensitive match
for the string
"
HTML",
its public identifier is an exact match for the literal string
"-//W3C//DTD HTML 4.0//EN",
and its system identifier is either missing or is an exact
match for the literal string
"http://www.w3.org/TR/REC-html40/strict.dtd".
- The
document type declaration’s
name part is a
case-insensitive match
for the string
"
HTML",
its public identifier is an exact match for the literal string
"-//W3C//DTD HTML 4.01//EN",
and its system identifier is either missing or is an exact
match for the literal string
"http://www.w3.org/TR/html4/strict.dtd".
- The
document type declaration’s
name part is a
case-insensitive match
for the string
"
HTML",
its public identifier is an exact match for the literal string
"-//W3C//DTD XHTML 1.0 Strict//EN",
and its system identifier is either missing or is an exact
match for the literal string
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd".
- The
document type declaration’s
name part is a
case-insensitive match
for the string
"
HTML",
its public identifier is an exact match for the literal string
"-//W3C//DTD XHTML 1.1//EN",
and its system identifier is either missing or is an exact
match for the literal string
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd".
The following are examples of some
deprecated doctypes.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
4.02. Character encoding declaration # T
A
character encoding declaration
is a mechanism for specifying the character encoding used to store
or transmit a document.
The following restrictions apply to character encoding
declarations:
- The character encoding name given must be the name of
the character encoding used to serialize the file.
- The value must be a valid character encoding name, and
must be the preferred name for that encoding.
[IANACHARSET]
- The character encoding declaration must be serialized
without the use of
character references
or character escapes of any kind.
- The element containing the character encoding
declaration must be serialized completely within the first
512 bytes of the document.
If the document does not start with a
U+FEFF BYTE ORDER MARK (BOM) character, and if its
encoding is not explicitly given by a
Content-Type HTTP header, then the character
encoding used must be an
ASCII-compatible character encoding,
and, in addition, if that encoding isn't US-ASCII itself, then
the encoding must be specified using a
meta element with a
charset
attribute or a meta element
in the
encoding declaration
state.
If the document contains a meta
element with a
charset
attribute or a meta element in the
encoding declaration state,
then the character encoding used must be an
ASCII-compatible character encoding.
An
ASCII-compatible character encoding
is one that is a superset of US-ASCII (specifically,
ANSI_X3.4-1968) for bytes in the set 0x09, 0x0A, 0x0C, 0x0D,
0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 -
0x7A.
Documents must not use the CESU-8, UTF-7, BOCU-1 and SCSU
encodings.
[CESU8]
[UTF7]
[BOCU1]
[SCSU]
In a
document the XML syntax,
the
XML declaration,
as defined in the XML specification
[XML]
should be used to provide character-encoding information, if
necessary.
4.03. Elements # T
An element’s
content model
defines the element’s structure: What
contents (if any) the element can
contain, as well as what attributes (if any) the element can
have. The
HTML elements
section of this specification defines the content models for
all of elements that are part of the
HTML language.
An element must not contain
contents
or attributes that are not part of its content model.
The
contents
of an element are any
elements,
character data,
and
comments
that it contains.
Attributes and their values are not considered to be the
“contents” of an element.
A
void element
is an element whose content model
does not allow it to have
contents.
Void elements can have attributes.
The following is a complete list of the
void elements in HTML:
-
area,
base,
br,
col,
command,
embed,
hr,
img,
input,
link,
meta,
param,
source
The following list describes syntax
rules for the
the HTML syntax.
Rules for the
the XML syntax
are defined in the XML specification
[XML].
-
Tags are used to
delimit the start and end of elements in markup. Elements
have a
start tag
to indicate where they begin. Non-void elements have an
end tag
to
indicate where they end.
-
Tag names
are used within element start tags and end tags to
give the element’s name.
HTML elements
all have names that only use characters in the range
0–9,
a–z,
and
A–Z.
-
Start tags
consist of the following parts, in exactly the following
order:
- A
"
<"
character.
- The element’s
tag name.
- Optionally, one or more
attributes,
each of which must be preceded by one or more
space characters.
- Optionally, one or more
space characters.
- Optionally, a
"
/"
character, which may be present only if the element is a
void element.
- A
"
>"
character.
-
End tags
consist of the following parts, in exactly the following
order:
- A
"
<"
character.
- A
"
/"
character
- The element’s
tag name.
- Optionally, one or more
space characters.
- A
"
>"
character.
-
Void elements only have a
start tag; end tags must not be specified for void
elements.
- The start and end tags of certain elements can be omitted.
The subsection for each element in the HTML elements section of this
specification provides information about which tags (if any)
can be omitted for that particular element.
- A non-void element must have
an end tag, unless the subsection for that element in the HTML elements section of this
specification indicates that its end tag can be omitted.
- The contents of an element must be
placed between just after its start tag (which
might be
implied, in certain cases) and just before its end tag
(which might be
implied in certain cases).
4.3.01. Misnested tags #
If an
element
has both a
start tag
and an
end tag,
its end tag must be contained within the
contents
of the same element in which its start tag is contained.
An
end tag
that is not contained within the same
contents
as its
start tag
is said to be a
misnested tag.
In the following example, the
"</i>"
end tag
is a
misnested tag,
because it is not contained
within the
contents
of the
b
element that contains its corresponding
"<i>"
start tag.
<b>foo <i>bar</b> baz</i>
4.04. Attributes # T
Attributes
for an element are expressed inside the element’s start
tag. Attributes have a
name
and a
value.
There must never be two or more attributes on the same
start tag whose names are a
case-insensitive match
for each other.
The following list describes syntax
rules for attributes in
documents in the HTML syntax.
Syntax rules for attributes in
documents in the XML syntax.
are defined in
the XML specification [XML].
-
Attribute names
must consist of one or more characters other than the
space characters,
U+0000 NULL,
"
"",
"'",
">",
"/",
"=",
the control characters,
and any characters that are not defined by Unicode.
-
XML-compatible
attribute names are those that match the
Name production defined in
the XML specification [XML]
and that contain no
":"
characters, and whose first three characters are not a
case-insensitive match
for the string "xml".
-
Attribute values, in
general, are
normal character data;
however, the HTML elements section
of this specification defines further restrictions on the
allowed values of all attributes that are part of the
HTML language.
An attribute must not have a value that is not allowed by
the
content model
of the element that contains it.
In the the HTML syntax,
attributes can be specified in four different ways:
- empty attribute syntax
- unquoted attribute-value syntax
- single-quoted attribute-value syntax
- double-quoted attribute-value syntax
- Empty attribute syntax
-
Certain attributes may be specified by providing just the
attribute name.
In the following example, the
disabled
attribute is given with the empty attribute
syntax:
<input disabled>
- Unquoted attribute-value syntax
-
An
unquoted attribute value
is specified by providing the following parts in exactly
the following order:
- an
attribute name
- zero or more
space characters
- a single
"
="
character
- zero or more
space characters
- an
attribute value
In addition to the general requirements given above for
attribute values, an unquoted attribute value has the
following restrictions:
- must not contain any literal
space characters
- must not contain any
"
"",
"'",
">",
"=",
characters
- must not be the empty string
In the following example, the
value
attribute is given with the unquoted attribute value
syntax:
<input value=yes>
If the value of an attribute using the unquoted
attribute syntax is followed by a
"/"
character, then there must be at least one
space character
after the value and before the
"/"
character.
- Single-quoted attribute-value syntax
-
A
single-quoted attribute value
is specified by providing the following parts in exactly
the following order:
- an
attribute name
- zero or more
space characters
- a
"
="
character
- zero or more
space characters
- a single
"
'"
character
- an
attribute value
- a
"
'"
character.
In addition to the general requirements given above
for attribute values, a single-quoted attribute value
has the following restriction:
- must not contain any literal
"
'"
characters
In the following example, the
type attribute
is given with the single-quoted attribute value
syntax:
<input type='checkbox'>
- Double-quoted attribute-value syntax
-
A
double-quoted attribute value
is specified by providing the following parts in exactly
the following order:
- an
attribute name
- zero or more
space characters
- a single
"
="
character
- zero or more
space characters
- a single
"
""
character
- an
attribute value
- a
"
""
character
In addition to the general requirements given above for
attribute values, a double-quoted attribute value has
the following restriction:
- must not contain any literal
"
""
characters
In the following example, the
title attribute is
given with the double-quoted attribute value syntax:
<code title="U+003C LESS-THAN SIGN"><</code>
4.05. Text and character data # T
Text
in
element contents
(including in
comments)
and
attribute values
must consist of Unicode characters, with the following
restrictions:
- must not contain U+0000 characters
- must not contain permanently undefined Unicode characters
- must not contain control characters other than
space characters
There are two special types of
text,
known as
escaping text span starts
and
escaping text span ends,
that can occur within certain elements.
Character data contains
text, in some cases in combination with
character references,
along with certain additional restrictions. There are three
types of character data that can occur in documents:
- normal character data
- replaceable character data
- non-replaceable character data
- Normal character data
-
Certain elements and strings in the values of
particular attributes contain normal character data.
Normal character data can contain the following:
Normal character data has the following restrictions:
- Replaceable character data
-
In
documents in the HTML syntax,
the
title
and
textarea
elements can contain replaceable character data.
Replaceable character data can contain the following:
Replaceable character data has the following restrictions:
- must not contain any
ambiguous ampersands
- must not contain any occurrences of the string
"
</"
followed by characters that are a
case-insensitive match
for the tag name of the element containing the
replaceable character data (for example,
"</title" or
"</textarea"),
followed by one of
U+0009 CHARACTER TABULATION,
U+000A LINE FEED (LF),
U+000C FORM FEED (FF),
U+0020 SPACE,
">",
or
"/",
unless that string is part of an
escaping text span.
Replaceable character data,
as defined in this specification, is a feature of
the HTML syntax
that is not available in
the XML syntax.
Documents in the XML
syntax must not contain replaceable character data
as defined in this specification; instead they must
conform to all syntax constraints defined in the XML
specification [XML].
- Non-replaceable character data
-
In
documents in the HTML syntax,
the
script,
and
style
elements can contain non-replaceable character data.
Non-replaceable character data can contain the
following:
Non-replaceable character data has the following restrictions:
- must not contain character references
- must not contain any occurrences of the string
"
</",
followed by characters that are a
case-insensitive match
for the tag name of the element containing the
replaceable character data (for example,
"</script"
or
"</style",
followed by one of
U+0009 CHARACTER TABULATION,
U+000A LINE FEED (LF),
U+000C FORM FEED (FF),
U+0020 SPACE,
">",
or
"/",
unless that string is part of an
escaping text span.
Non-replaceable character data,
as defined in this specification, is a feature of
the HTML syntax
that is not available in
the XML syntax.
Documents in the XML
syntax must not contain non-replaceable character
data as defined in this specification; instead they must
conform to all syntax constraints defined in the XML
specification [XML].
4.06. Character references # T
Character references are a form
of markup for representing single individual characters. There
are three types of character references:
- Named character reference
-
Named character references consist of the following
parts in exactly the following order:
- An
"
&"
character.
- One of the entity names defined in
XML Entity definitions for Characters
[Entities],
using the same case.
- A
"
;"
character.
The following is an example of a named character
reference for the character
"†"
(U+2020 DAGGER).
†
- Decimal numeric character reference
-
Decimal numerical character references consist of the
following parts, in exactly the following order.
- An
"
&"
character.
- A
"
#"
character.
- One or more digits in the range
0–9,
representing a base-ten integer that itself is a Unicode
code point that is not
U+0000,
U+000D,
in the range U+0080–U+009F,
or in the range 0xD8000–0xDFFF (surrogates).
- A
"
;"
character.
The following is an example of a decimal numeric
character reference for the character
"†"
(U+2020 DAGGER).
†
- Hexadecimal numeric character reference
-
Hexadecimal numeric character references consist of
the following parts, in exactly the following order.
- An
"
&"
character.
- A
"
#"
character.
- Either a
"
x"
character
or a
"X"
character.
- One or more digits in the range
0–9,
a–f,
and
A–F,
representing a base-sixteen integer that itself is a
Unicode code point that is not
U+0000,
U+000D,
in the range U+0080–U+009F,
or in the range 0xD800–0xDFFF (surrogates).
- A
"
;"
character.
The following is an example of a hexadecimal numeric
character reference for the character
"†"
(U+2020 DAGGER).
†
Character references
are not themselves
text,
and no part of a character reference is
text.
An
ambiguous ampersand
is an
"&"
character that is followed by some
text
other than a
space character,
a
"<",
character, or another
"&"
character.
4.08. Escaping text spans # T
An
escaping text span
is a span of
text
that starts with an
escaping text span start
that is not itself in an
escaping text span,
and ends at the next
escaping text span end.
Escaping text spans have the following restriction:
An
escaping text span start
is the
text
string
"<!--".
An
escaping text span end
is the
text
string
"-->".
An
escaping text span start
may share its
"-"
characters with its corresponding
escaping text span end.
The text in
style,
script,
title,
and
textarea
elements must not have an
escaping text span start
that is not followed by an
escaping text span end.
The following is an example of an
escaping text span
within a
style element.
<style>
<!--
dfn { font-weight: bold; color: brown; }
-->
</style>
4.10. CDATA sections in SVG and MathML contents # T
CDATA sections in SVG and MathML contents
in
documents in the HTML syntax
consist of the following parts, in exactly the following
order:
- the
CDATA start delimiter
"
<![CDATA["
-
text, with the
additional restriction that the text must not contain the
string
"
]]>“
- the
CDATA end delimiter
"
]]>"
CDATA sections are allowed only in the contents of elements
from the SVG and MathML namespaces.
The following shows an example of a CDATA section.
<annotation encoding="text/latex">
<![CDATA[\documentclass{article}
\begin{document}
\title{E}
\maketitle
The base of the natural logarithms, approximately 2.71828.
\end{document}]]>
</annotation>
4.07. Comments # T
Comments consist of the following parts, in exactly the following order:
<!--"-->"The text part of comments has the following restrictions:
>" character->"--"-" characterThe following is an example of a comment.