|
|
#8.2 Parsing HTML documents Table of contents 8.2.5 Tree construction
|
|
|
|
|
|
WHATWG
|
|
|
|
|
|
HTML 5
|
|
|
|
|
|
Draft Recommendation — 7 February 2009
|
|
|
|
|
|
← 8.2 Parsing HTML documents – Table of contents – 8.2.5 Tree
|
|
|
construction →
|
|
|
|
|
|
8.2.4 Tokenization
|
|
|
|
|
|
Implementations must act as if they used the following state machine to
|
|
|
tokenise HTML. The state machine must start in the data state. Most
|
|
|
states consume a single character, which may have various side-effects,
|
|
|
and either switches the state machine to a new state to reconsume the
|
|
|
same character, or switches it to a new state (to consume the next
|
|
|
character), or repeats the same state (to consume the next character).
|
|
|
Some states have more complicated behavior and can consume several
|
|
|
characters before switching to another state.
|
|
|
|
|
|
The exact behavior of certain states depends on a content model flag
|
|
|
that is set after certain tokens are emitted. The flag has several
|
|
|
states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in
|
|
|
the PCDATA state. In the RCDATA and CDATA states, a further escape flag
|
|
|
is used to control the behavior of the tokeniser. It is either true or
|
|
|
false, and initially must be set to the false state. The insertion mode
|
|
|
and the stack of open elements also affects tokenization.
|
|
|
|
|
|
The output of the tokenization step is a series of zero or more of the
|
|
|
following tokens: DOCTYPE, start tag, end tag, comment, character,
|
|
|
end-of-file. DOCTYPE tokens have a name, a public identifier, a system
|
|
|
identifier, and a force-quirks flag. When a DOCTYPE token is created,
|
|
|
its name, public identifier, and system identifier must be marked as
|
|
|
missing (which is a distinct state from the empty string), and the
|
|
|
force-quirks flag must be set to off (its other state is on). Start and
|
|
|
end tag tokens have a tag name, a self-closing flag, and a list of
|
|
|
attributes, each of which has a name and a value. When a start or end
|
|
|
tag token is created, its self-closing flag must be unset (its other
|
|
|
state is that it be set), and its attributes list must be empty.
|
|
|
Comment and character tokens have data.
|
|
|
|
|
|
When a token is emitted, it must immediately be handled by the tree
|
|
|
construction stage. The tree construction stage can affect the state of
|
|
|
the content model flag, and can insert additional characters into the
|
|
|
stream. (For example, the script element can result in scripts
|
|
|
executing and using the dynamic markup insertion APIs to insert
|
|
|
characters into the stream being tokenised.)
|
|
|
|
|
|
When a start tag token is emitted with its self-closing flag set, if
|
|
|
the flag is not acknowledged when it is processed by the tree
|
|
|
construction stage, that is a parse error.
|
|
|
|
|
|
When an end tag token is emitted, the content model flag must be
|
|
|
switched to the PCDATA state.
|
|
|
|
|
|
When an end tag token is emitted with attributes, that is a parse
|
|
|
error.
|
|
|
|
|
|
When an end tag token is emitted with its self-closing flag set, that
|
|
|
is a parse error.
|
|
|
|
|
|
Before each step of the tokeniser, the user agent must first check the
|
|
|
parser pause flag. If it is true, then the tokeniser must abort the
|
|
|
processing of any nested invocations of the tokeniser, yielding control
|
|
|
back to the caller. If it is false, then the user agent may then check
|
|
|
to see if either one of the scripts in the list of scripts that will
|
|
|
execute as soon as possible or the first script in the list of scripts
|
|
|
that will execute asynchronously, has completed loading. If one has,
|
|
|
then it must be executed and removed from its list.
|
|
|
|
|
|
The tokeniser state machine consists of the states defined in the
|
|
|
following subsections.
|
|
|
|
|
|
8.2.4.1 Data state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0026 AMPERSAND (&)
|
|
|
When the content model flag is set to one of the PCDATA or
|
|
|
RCDATA states and the escape flag is false: switch to the
|
|
|
character reference data state.
|
|
|
Otherwise: treat it as per the "anything else" entry below.
|
|
|
|
|
|
U+002D HYPHEN-MINUS (-)
|
|
|
If the content model flag is set to either the RCDATA state or
|
|
|
the CDATA state, and the escape flag is false, and there are at
|
|
|
least three characters before this one in the input stream, and
|
|
|
the last four characters in the input stream, including this
|
|
|
one, are U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D
|
|
|
HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the
|
|
|
escape flag to true.
|
|
|
|
|
|
In any case, emit the input character as a character token. Stay
|
|
|
in the data state.
|
|
|
|
|
|
U+003C LESS-THAN SIGN (<)
|
|
|
When the content model flag is set to the PCDATA state: switch
|
|
|
to the tag open state.
|
|
|
When the content model flag is set to either the RCDATA state or
|
|
|
the CDATA state, and the escape flag is false: switch to the tag
|
|
|
open state.
|
|
|
Otherwise: treat it as per the "anything else" entry below.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
If the content model flag is set to either the RCDATA state or
|
|
|
the CDATA state, and the escape flag is true, and the last three
|
|
|
characters in the input stream including this one are U+002D
|
|
|
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN
|
|
|
("-->"), set the escape flag to false.
|
|
|
|
|
|
In any case, emit the input character as a character token. Stay
|
|
|
in the data state.
|
|
|
|
|
|
EOF
|
|
|
Emit an end-of-file token.
|
|
|
|
|
|
Anything else
|
|
|
Emit the input character as a character token. Stay in the data
|
|
|
state.
|
|
|
|
|
|
8.2.4.2 Character reference data state
|
|
|
|
|
|
(This cannot happen if the content model flag is set to the CDATA
|
|
|
state.)
|
|
|
|
|
|
Attempt to consume a character reference, with no additional allowed
|
|
|
character.
|
|
|
|
|
|
If nothing is returned, emit a U+0026 AMPERSAND character token.
|
|
|
|
|
|
Otherwise, emit the character token that was returned.
|
|
|
|
|
|
Finally, switch to the data state.
|
|
|
|
|
|
8.2.4.3 Tag open state
|
|
|
|
|
|
The behavior of this state depends on the content model flag.
|
|
|
|
|
|
If the content model flag is set to the RCDATA or CDATA states
|
|
|
Consume the next input character. If it is a U+002F SOLIDUS (/)
|
|
|
character, switch to the close tag open state. Otherwise, emit a
|
|
|
U+003C LESS-THAN SIGN character token and reconsume the current
|
|
|
input character in the data state.
|
|
|
|
|
|
If the content model flag is set to the PCDATA state
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0021 EXCLAMATION MARK (!)
|
|
|
Switch to the markup declaration open state.
|
|
|
|
|
|
U+002F SOLIDUS (/)
|
|
|
Switch to the close tag open state.
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
|
|
|
LETTER Z
|
|
|
Create a new start tag token, set its tag name to the
|
|
|
lowercase version of the input character (add 0x0020 to
|
|
|
the character's code point), then switch to the tag name
|
|
|
state. (Don't emit the token yet; further details will be
|
|
|
filled in before it is emitted.)
|
|
|
|
|
|
U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
|
|
|
Create a new start tag token, set its tag name to the
|
|
|
input character, then switch to the tag name state. (Don't
|
|
|
emit the token yet; further details will be filled in
|
|
|
before it is emitted.)
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Emit a U+003C LESS-THAN SIGN character token
|
|
|
and a U+003E GREATER-THAN SIGN character token. Switch to
|
|
|
the data state.
|
|
|
|
|
|
U+003F QUESTION MARK (?)
|
|
|
Parse error. Switch to the bogus comment state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Emit a U+003C LESS-THAN SIGN character token
|
|
|
and reconsume the current input character in the data
|
|
|
state.
|
|
|
|
|
|
8.2.4.4 Close tag open state
|
|
|
|
|
|
If the content model flag is set to the RCDATA or CDATA states but no
|
|
|
start tag token has ever been emitted by this instance of the tokeniser
|
|
|
(fragment case), or, if the content model flag is set to the RCDATA or
|
|
|
CDATA states and the next few characters do not match the tag name of
|
|
|
the last start tag token emitted (compared in an ASCII case-insensitive
|
|
|
manner), or if they do but they are not immediately followed by one of
|
|
|
the following characters:
|
|
|
* U+0009 CHARACTER TABULATION
|
|
|
* U+000A LINE FEED (LF)
|
|
|
* U+000C FORM FEED (FF)
|
|
|
* U+0020 SPACE
|
|
|
* U+003E GREATER-THAN SIGN (>)
|
|
|
* U+002F SOLIDUS (/)
|
|
|
* EOF
|
|
|
|
|
|
...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS
|
|
|
character token, and switch to the data state to process the next input
|
|
|
character.
|
|
|
|
|
|
Otherwise, if the content model flag is set to the PCDATA state, or if
|
|
|
the next few characters do match that tag name, consume the next input
|
|
|
character:
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
|
|
Create a new end tag token, set its tag name to the lowercase
|
|
|
version of the input character (add 0x0020 to the character's
|
|
|
code point), then switch to the tag name state. (Don't emit the
|
|
|
token yet; further details will be filled in before it is
|
|
|
emitted.)
|
|
|
|
|
|
U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
|
|
|
Create a new end tag token, set its tag name to the input
|
|
|
character, then switch to the tag name state. (Don't emit the
|
|
|
token yet; further details will be filled in before it is
|
|
|
emitted.)
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit a U+003C LESS-THAN SIGN character token and a
|
|
|
U+002F SOLIDUS character token. Reconsume the EOF character in
|
|
|
the data state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Switch to the bogus comment state.
|
|
|
|
|
|
8.2.4.5 Tag name state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Switch to the before attribute name state.
|
|
|
|
|
|
U+002F SOLIDUS (/)
|
|
|
Switch to the self-closing start tag state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current tag token. Switch to the data state.
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
|
|
Append the lowercase version of the current input character (add
|
|
|
0x0020 to the character's code point) to the current tag token's
|
|
|
tag name. Stay in the tag name state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the EOF
|
|
|
character in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current tag token's
|
|
|
tag name. Stay in the tag name state.
|
|
|
|
|
|
8.2.4.6 Before attribute name state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the before attribute name state.
|
|
|
|
|
|
U+002F SOLIDUS (/)
|
|
|
Switch to the self-closing start tag state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current tag token. Switch to the data state.
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
|
|
Start a new attribute in the current tag token. Set that
|
|
|
attribute's name to the lowercase version of the current input
|
|
|
character (add 0x0020 to the character's code point), and its
|
|
|
value to the empty string. Switch to the attribute name state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
U+0027 APOSTROPHE (')
|
|
|
U+003D EQUALS SIGN (=)
|
|
|
Parse error. Treat it as per the "anything else" entry below.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the EOF
|
|
|
character in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Start a new attribute in the current tag token. Set that
|
|
|
attribute's name to the current input character, and its value
|
|
|
to the empty string. Switch to the attribute name state.
|
|
|
|
|
|
8.2.4.7 Attribute name state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Switch to the after attribute name state.
|
|
|
|
|
|
U+002F SOLIDUS (/)
|
|
|
Switch to the self-closing start tag state.
|
|
|
|
|
|
U+003D EQUALS SIGN (=)
|
|
|
Switch to the before attribute value state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current tag token. Switch to the data state.
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
|
|
Append the lowercase version of the current input character (add
|
|
|
0x0020 to the character's code point) to the current attribute's
|
|
|
name. Stay in the attribute name state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Parse error. Treat it as per the "anything else" entry below.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the EOF
|
|
|
character in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current attribute's
|
|
|
name. Stay in the attribute name state.
|
|
|
|
|
|
When the user agent leaves the attribute name state (and before
|
|
|
emitting the tag token, if appropriate), the complete attribute's name
|
|
|
must be compared to the other attributes on the same token; if there is
|
|
|
already an attribute on the token with the exact same name, then this
|
|
|
is a parse error and the new attribute must be dropped, along with the
|
|
|
value that gets associated with it (if any).
|
|
|
|
|
|
8.2.4.8 After attribute name state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the after attribute name state.
|
|
|
|
|
|
U+002F SOLIDUS (/)
|
|
|
Switch to the self-closing start tag state.
|
|
|
|
|
|
U+003D EQUALS SIGN (=)
|
|
|
Switch to the before attribute value state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current tag token. Switch to the data state.
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
|
|
Start a new attribute in the current tag token. Set that
|
|
|
attribute's name to the lowercase version of the current input
|
|
|
character (add 0x0020 to the character's code point), and its
|
|
|
value to the empty string. Switch to the attribute name state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Parse error. Treat it as per the "anything else" entry below.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the EOF
|
|
|
character in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Start a new attribute in the current tag token. Set that
|
|
|
attribute's name to the current input character, and its value
|
|
|
to the empty string. Switch to the attribute name state.
|
|
|
|
|
|
8.2.4.9 Before attribute value state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the before attribute value state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
Switch to the attribute value (double-quoted) state.
|
|
|
|
|
|
U+0026 AMPERSAND (&)
|
|
|
Switch to the attribute value (unquoted) state and reconsume
|
|
|
this input character.
|
|
|
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Switch to the attribute value (single-quoted) state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Emit the current tag token. Switch to the data
|
|
|
state.
|
|
|
|
|
|
U+003D EQUALS SIGN (=)
|
|
|
Parse error. Treat it as per the "anything else" entry below.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current attribute's
|
|
|
value. Switch to the attribute value (unquoted) state.
|
|
|
|
|
|
8.2.4.10 Attribute value (double-quoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
Switch to the after attribute value (quoted) state.
|
|
|
|
|
|
U+0026 AMPERSAND (&)
|
|
|
Switch to the character reference in attribute value state, with
|
|
|
the additional allowed character being U+0022 QUOTATION MARK
|
|
|
(").
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current attribute's
|
|
|
value. Stay in the attribute value (double-quoted) state.
|
|
|
|
|
|
8.2.4.11 Attribute value (single-quoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Switch to the after attribute value (quoted) state.
|
|
|
|
|
|
U+0026 AMPERSAND (&)
|
|
|
Switch to the character reference in attribute value state, with
|
|
|
the additional allowed character being U+0027 APOSTROPHE (').
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current attribute's
|
|
|
value. Stay in the attribute value (single-quoted) state.
|
|
|
|
|
|
8.2.4.12 Attribute value (unquoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Switch to the before attribute name state.
|
|
|
|
|
|
U+0026 AMPERSAND (&)
|
|
|
Switch to the character reference in attribute value state, with
|
|
|
no additional allowed character.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current tag token. Switch to the data state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
U+0027 APOSTROPHE (')
|
|
|
U+003D EQUALS SIGN (=)
|
|
|
Parse error. Treat it as per the "anything else" entry below.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current attribute's
|
|
|
value. Stay in the attribute value (unquoted) state.
|
|
|
|
|
|
8.2.4.13 Character reference in attribute value state
|
|
|
|
|
|
Attempt to consume a character reference.
|
|
|
|
|
|
If nothing is returned, append a U+0026 AMPERSAND character to the
|
|
|
current attribute's value.
|
|
|
|
|
|
Otherwise, append the returned character token to the current
|
|
|
attribute's value.
|
|
|
|
|
|
Finally, switch back to the attribute value state that you were in when
|
|
|
were switched into this state.
|
|
|
|
|
|
8.2.4.14 After attribute value (quoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Switch to the before attribute name state.
|
|
|
|
|
|
U+002F SOLIDUS (/)
|
|
|
Switch to the self-closing start tag state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current tag token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the EOF
|
|
|
character in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Reconsume the character in the before attribute
|
|
|
name state.
|
|
|
|
|
|
8.2.4.15 Self-closing start tag state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Set the self-closing flag of the current tag token. Emit the
|
|
|
current tag token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the current tag token. Reconsume the EOF
|
|
|
character in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Reconsume the character in the before attribute
|
|
|
name state.
|
|
|
|
|
|
8.2.4.16 Bogus comment state
|
|
|
|
|
|
(This can only happen if the content model flag is set to the PCDATA
|
|
|
state.)
|
|
|
|
|
|
Consume every character up to and including the first U+003E
|
|
|
GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
|
|
|
comes first. Emit a comment token whose data is the concatenation of
|
|
|
all the characters starting from and including the character that
|
|
|
caused the state machine to switch into the bogus comment state, up to
|
|
|
and including the character immediately before the last consumed
|
|
|
character (i.e. up to the character just before the U+003E or EOF
|
|
|
character). (If the comment was started by the end of the file (EOF),
|
|
|
the token is empty.)
|
|
|
|
|
|
Switch to the data state.
|
|
|
|
|
|
If the end of the file was reached, reconsume the EOF character.
|
|
|
|
|
|
8.2.4.17 Markup declaration open state
|
|
|
|
|
|
(This can only happen if the content model flag is set to the PCDATA
|
|
|
state.)
|
|
|
|
|
|
If the next two characters are both U+002D HYPHEN-MINUS (-) characters,
|
|
|
consume those two characters, create a comment token whose data is the
|
|
|
empty string, and switch to the comment start state.
|
|
|
|
|
|
Otherwise, if the next seven characters are an ASCII case-insensitive
|
|
|
match for the word "DOCTYPE", then consume those characters and switch
|
|
|
to the DOCTYPE state.
|
|
|
|
|
|
Otherwise, if the insertion mode is "in foreign content" and the
|
|
|
current node is not an element in the HTML namespace and the next seven
|
|
|
characters are an ASCII case-sensitive match for the string "[CDATA["
|
|
|
(the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
|
|
|
character before and after), then consume those characters and switch
|
|
|
to the CDATA section state (which is unrelated to the content model
|
|
|
flag's CDATA state).
|
|
|
|
|
|
Otherwise, this is a parse error. Switch to the bogus comment state.
|
|
|
The next character that is consumed, if any, is the first character
|
|
|
that will be in the comment.
|
|
|
|
|
|
8.2.4.18 Comment start state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+002D HYPHEN-MINUS (-)
|
|
|
Switch to the comment start dash state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Emit the comment token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the comment token. Reconsume the EOF character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the input character to the comment token's data. Switch
|
|
|
to the comment state.
|
|
|
|
|
|
8.2.4.19 Comment start dash state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+002D HYPHEN-MINUS (-)
|
|
|
Switch to the comment end state
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Emit the comment token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the comment token. Reconsume the EOF character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append a U+002D HYPHEN-MINUS (-) character and the input
|
|
|
character to the comment token's data. Switch to the comment
|
|
|
state.
|
|
|
|
|
|
8.2.4.20 Comment state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+002D HYPHEN-MINUS (-)
|
|
|
Switch to the comment end dash state
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the comment token. Reconsume the EOF character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append the input character to the comment token's data. Stay in
|
|
|
the comment state.
|
|
|
|
|
|
8.2.4.21 Comment end dash state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+002D HYPHEN-MINUS (-)
|
|
|
Switch to the comment end state
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the comment token. Reconsume the EOF character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Append a U+002D HYPHEN-MINUS (-) character and the input
|
|
|
character to the comment token's data. Switch to the comment
|
|
|
state.
|
|
|
|
|
|
8.2.4.22 Comment end state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the comment token. Switch to the data state.
|
|
|
|
|
|
U+002D HYPHEN-MINUS (-)
|
|
|
Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
|
|
|
comment token's data. Stay in the comment end state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Emit the comment token. Reconsume the EOF character
|
|
|
in the data state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Append two U+002D HYPHEN-MINUS (-) characters and
|
|
|
the input character to the comment token's data. Switch to the
|
|
|
comment state.
|
|
|
|
|
|
8.2.4.23 DOCTYPE state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Switch to the before DOCTYPE name state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Reconsume the current character in the before
|
|
|
DOCTYPE name state.
|
|
|
|
|
|
8.2.4.24 Before DOCTYPE name state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the before DOCTYPE name state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Create a new DOCTYPE token. Set its force-quirks
|
|
|
flag to on. Emit the token. Switch to the data state.
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
|
|
Create a new DOCTYPE token. Set the token's name to the
|
|
|
lowercase version of the input character (add 0x0020 to the
|
|
|
character's code point). Switch to the DOCTYPE name state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Create a new DOCTYPE token. Set its force-quirks
|
|
|
flag to on. Emit the token. Reconsume the EOF character in the
|
|
|
data state.
|
|
|
|
|
|
Anything else
|
|
|
Create a new DOCTYPE token. Set the token's name to the current
|
|
|
input character. Switch to the DOCTYPE name state.
|
|
|
|
|
|
8.2.4.25 DOCTYPE name state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Switch to the after DOCTYPE name state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
|
|
|
Append the lowercase version of the input character (add 0x0020
|
|
|
to the character's code point) to the current DOCTYPE token's
|
|
|
name. Stay in the DOCTYPE name state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current DOCTYPE
|
|
|
token's name. Stay in the DOCTYPE name state.
|
|
|
|
|
|
8.2.4.26 After DOCTYPE name state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the after DOCTYPE name state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
If the six characters starting from the current input character
|
|
|
are an ASCII case-insensitive match for the word "PUBLIC", then
|
|
|
consume those characters and switch to the before DOCTYPE public
|
|
|
identifier state.
|
|
|
|
|
|
Otherwise, if the six characters starting from the current input
|
|
|
character are an ASCII case-insensitive match for the word
|
|
|
"SYSTEM", then consume those characters and switch to the before
|
|
|
DOCTYPE system identifier state.
|
|
|
|
|
|
Otherwise, this is the parse error. Set the DOCTYPE token's
|
|
|
force-quirks flag to on. Switch to the bogus DOCTYPE state.
|
|
|
|
|
|
8.2.4.27 Before DOCTYPE public identifier state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the before DOCTYPE public identifier state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
Set the DOCTYPE token's public identifier to the empty string
|
|
|
(not missing), then switch to the DOCTYPE public identifier
|
|
|
(double-quoted) state.
|
|
|
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Set the DOCTYPE token's public identifier to the empty string
|
|
|
(not missing), then switch to the DOCTYPE public identifier
|
|
|
(single-quoted) state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Switch to the bogus DOCTYPE state.
|
|
|
|
|
|
8.2.4.28 DOCTYPE public identifier (double-quoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
Switch to the after DOCTYPE public identifier state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current DOCTYPE
|
|
|
token's public identifier. Stay in the DOCTYPE public identifier
|
|
|
(double-quoted) state.
|
|
|
|
|
|
8.2.4.29 DOCTYPE public identifier (single-quoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Switch to the after DOCTYPE public identifier state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current DOCTYPE
|
|
|
token's public identifier. Stay in the DOCTYPE public identifier
|
|
|
(single-quoted) state.
|
|
|
|
|
|
8.2.4.30 After DOCTYPE public identifier state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the after DOCTYPE public identifier state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
Set the DOCTYPE token's system identifier to the empty string
|
|
|
(not missing), then switch to the DOCTYPE system identifier
|
|
|
(double-quoted) state.
|
|
|
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Set the DOCTYPE token's system identifier to the empty string
|
|
|
(not missing), then switch to the DOCTYPE system identifier
|
|
|
(single-quoted) state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Switch to the bogus DOCTYPE state.
|
|
|
|
|
|
8.2.4.31 Before DOCTYPE system identifier state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the before DOCTYPE system identifier state.
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
Set the DOCTYPE token's system identifier to the empty string
|
|
|
(not missing), then switch to the DOCTYPE system identifier
|
|
|
(double-quoted) state.
|
|
|
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Set the DOCTYPE token's system identifier to the empty string
|
|
|
(not missing), then switch to the DOCTYPE system identifier
|
|
|
(single-quoted) state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Switch to the bogus DOCTYPE state.
|
|
|
|
|
|
8.2.4.32 DOCTYPE system identifier (double-quoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0022 QUOTATION MARK (")
|
|
|
Switch to the after DOCTYPE system identifier state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current DOCTYPE
|
|
|
token's system identifier. Stay in the DOCTYPE system identifier
|
|
|
(double-quoted) state.
|
|
|
|
|
|
8.2.4.33 DOCTYPE system identifier (single-quoted) state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0027 APOSTROPHE (')
|
|
|
Switch to the after DOCTYPE system identifier state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Append the current input character to the current DOCTYPE
|
|
|
token's system identifier. Stay in the DOCTYPE system identifier
|
|
|
(single-quoted) state.
|
|
|
|
|
|
8.2.4.34 After DOCTYPE system identifier state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
Stay in the after DOCTYPE system identifier state.
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the current DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Parse error. Set the DOCTYPE token's force-quirks flag to on.
|
|
|
Emit that DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Parse error. Switch to the bogus DOCTYPE state. (This does not
|
|
|
set the DOCTYPE token's force-quirks flag to on.)
|
|
|
|
|
|
8.2.4.35 Bogus DOCTYPE state
|
|
|
|
|
|
Consume the next input character:
|
|
|
|
|
|
U+003E GREATER-THAN SIGN (>)
|
|
|
Emit the DOCTYPE token. Switch to the data state.
|
|
|
|
|
|
EOF
|
|
|
Emit the DOCTYPE token. Reconsume the EOF character in the data
|
|
|
state.
|
|
|
|
|
|
Anything else
|
|
|
Stay in the bogus DOCTYPE state.
|
|
|
|
|
|
8.2.4.36 CDATA section state
|
|
|
|
|
|
(This can only happen if the content model flag is set to the PCDATA
|
|
|
state, and is unrelated to the content model flag's CDATA state.)
|
|
|
|
|
|
Consume every character up to the next occurrence of the three
|
|
|
character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
|
|
|
BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF),
|
|
|
whichever comes first. Emit a series of character tokens consisting of
|
|
|
all the characters consumed except the matching three character
|
|
|
sequence at the end (if one was found before the end of the file).
|
|
|
|
|
|
Switch to the data state.
|
|
|
|
|
|
If the end of the file was reached, reconsume the EOF character.
|
|
|
|
|
|
8.2.4.37 Tokenizing character references
|
|
|
|
|
|
This section defines how to consume a character reference. This
|
|
|
definition is used when parsing character references in text and in
|
|
|
attributes.
|
|
|
|
|
|
The behavior depends on the identity of the next character (the one
|
|
|
immediately after the U+0026 AMPERSAND character):
|
|
|
|
|
|
U+0009 CHARACTER TABULATION
|
|
|
U+000A LINE FEED (LF)
|
|
|
U+000C FORM FEED (FF)
|
|
|
U+0020 SPACE
|
|
|
U+003C LESS-THAN SIGN
|
|
|
U+0026 AMPERSAND
|
|
|
EOF
|
|
|
The additional allowed character, if there is one
|
|
|
Not a character reference. No characters are consumed, and
|
|
|
nothing is returned. (This is not an error, either.)
|
|
|
|
|
|
U+0023 NUMBER SIGN (#)
|
|
|
Consume the U+0023 NUMBER SIGN.
|
|
|
|
|
|
The behavior further depends on the character after the U+0023
|
|
|
NUMBER SIGN:
|
|
|
|
|
|
U+0078 LATIN SMALL LETTER X
|
|
|
U+0058 LATIN CAPITAL LETTER X
|
|
|
Consume the X.
|
|
|
|
|
|
Follow the steps below, but using the range of characters
|
|
|
U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061
|
|
|
LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER
|
|
|
F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046
|
|
|
LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f).
|
|
|
|
|
|
When it comes to interpreting the number, interpret it as
|
|
|
a hexadecimal number.
|
|
|
|
|
|
Anything else
|
|
|
Follow the steps below, but using the range of characters
|
|
|
U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just
|
|
|
0-9).
|
|
|
|
|
|
When it comes to interpreting the number, interpret it as
|
|
|
a decimal number.
|
|
|
|
|
|
Consume as many characters as match the range of characters
|
|
|
given above.
|
|
|
|
|
|
If no characters match the range, then don't consume any
|
|
|
characters (and unconsume the U+0023 NUMBER SIGN character and,
|
|
|
if appropriate, the X character). This is a parse error; nothing
|
|
|
is returned.
|
|
|
|
|
|
Otherwise, if the next character is a U+003B SEMICOLON, consume
|
|
|
that too. If it isn't, there is a parse error.
|
|
|
|
|
|
If one or more characters match the range, then take them all
|
|
|
and interpret the string of characters as a number (either
|
|
|
hexadecimal or decimal as appropriate).
|
|
|
|
|
|
If that number is one of the numbers in the first column of the
|
|
|
following table, then this is a parse error. Find the row with
|
|
|
that number in the first column, and return a character token
|
|
|
for the Unicode character given in the second column of that
|
|
|
row.
|
|
|
|
|
|
Number Unicode character
|
|
|
0x0D U+000A LINE FEED (LF)
|
|
|
0x80 U+20AC EURO SIGN ('€')
|
|
|
0x81 U+FFFD REPLACEMENT CHARACTER
|
|
|
0x82 U+201A SINGLE LOW-9 QUOTATION MARK ('‚')
|
|
|
0x83 U+0192 LATIN SMALL LETTER F WITH HOOK ('ƒ')
|
|
|
0x84 U+201E DOUBLE LOW-9 QUOTATION MARK ('„')
|
|
|
0x85 U+2026 HORIZONTAL ELLIPSIS ('…')
|
|
|
0x86 U+2020 DAGGER ('†')
|
|
|
0x87 U+2021 DOUBLE DAGGER ('‡')
|
|
|
0x88 U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ')
|
|
|
0x89 U+2030 PER MILLE SIGN ('‰')
|
|
|
0x8A U+0160 LATIN CAPITAL LETTER S WITH CARON ('Š')
|
|
|
0x8B U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹')
|
|
|
0x8C U+0152 LATIN CAPITAL LIGATURE OE ('Œ')
|
|
|
0x8D U+FFFD REPLACEMENT CHARACTER
|
|
|
0x8E U+017D LATIN CAPITAL LETTER Z WITH CARON ('Ž')
|
|
|
0x8F U+FFFD REPLACEMENT CHARACTER
|
|
|
0x90 U+FFFD REPLACEMENT CHARACTER
|
|
|
0x91 U+2018 LEFT SINGLE QUOTATION MARK ('‘')
|
|
|
0x92 U+2019 RIGHT SINGLE QUOTATION MARK ('’')
|
|
|
0x93 U+201C LEFT DOUBLE QUOTATION MARK ('“')
|
|
|
0x94 U+201D RIGHT DOUBLE QUOTATION MARK ('”')
|
|
|
0x95 U+2022 BULLET ('•')
|
|
|
0x96 U+2013 EN DASH ('–')
|
|
|
0x97 U+2014 EM DASH ('—')
|
|
|
0x98 U+02DC SMALL TILDE ('˜')
|
|
|
0x99 U+2122 TRADE MARK SIGN ('™')
|
|
|
0x9A U+0161 LATIN SMALL LETTER S WITH CARON ('š')
|
|
|
0x9B U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›')
|
|
|
0x9C U+0153 LATIN SMALL LIGATURE OE ('œ')
|
|
|
0x9D U+FFFD REPLACEMENT CHARACTER
|
|
|
0x9E U+017E LATIN SMALL LETTER Z WITH CARON ('ž')
|
|
|
0x9F U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
|
|
|
|
|
|
Otherwise, if the number is in the range 0x0000 to 0x0008,
|
|
|
0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to
|
|
|
0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
|
|
|
0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
|
|
|
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
|
|
|
0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
|
|
|
0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
|
|
|
0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is
|
|
|
a parse error; return a character token for the U+FFFD
|
|
|
REPLACEMENT CHARACTER character instead.
|
|
|
|
|
|
Otherwise, return a character token for the Unicode character
|
|
|
whose code point is that number.
|
|
|
|
|
|
Anything else
|
|
|
Consume the maximum number of characters possible, with the
|
|
|
consumed characters matching one of the identifiers in the first
|
|
|
column of the named character references table (in a
|
|
|
case-sensitive manner).
|
|
|
|
|
|
If no match can be made, then this is a parse error. No
|
|
|
characters are consumed, and nothing is returned.
|
|
|
|
|
|
If the last character matched is not a U+003B SEMICOLON (;),
|
|
|
there is a parse error.
|
|
|
|
|
|
If the character reference is being consumed as part of an
|
|
|
attribute, and the last character matched is not a U+003B
|
|
|
SEMICOLON (;), and the next character is in the range U+0030
|
|
|
DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A
|
|
|
to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A
|
|
|
to U+007A LATIN SMALL LETTER Z, then, for historical reasons,
|
|
|
all the characters that were matched after the U+0026 AMPERSAND
|
|
|
(&) must be unconsumed, and nothing is returned.
|
|
|
|
|
|
Otherwise, return a character token for the character
|
|
|
corresponding to the character reference name (as given by the
|
|
|
second column of the named character references table).
|
|
|
|
|
|
If the markup contains I'm ¬it; I tell you, the character
|
|
|
reference is parsed as "not", as in, I'm ¬it; I tell you. But if
|
|
|
the markup was I'm ∉ I tell you, the character reference
|
|
|
would be parsed as "notin;", resulting in I'm ∉ I tell you.
|