Why does the entity
have length 6 while the entity
↓ has length 1? Is this in the spec somewhere? (Tested in Firefox, Chrome and Safari.)
I agree that this is very weird behavior, but at least it's specified.
The HTML fragment serialization algorithm states that:
Escaping a string (for the purposes of the algorithm above) consists of replacing any occurrences of the "&" character by the string "&", any occurrences of the "<" character by the string "<", any occurrences of the ">" character by the string ">", any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ", and, if the algorithm was invoked in the attribute mode, any occurrences of the """ character by the string """.
Emphasis by me. If I had to guess this is to support backwards compatibility in older browsers that did this and to get consistent behavior when deserializing and serializing strings. If the browser serialized the DOM tree result of
<div> </div> to
<div> </div> deserializing it to the DOM tree again would result in a single space*. This is pretty much the only way the browser can achieve consistent behavior.
The replacement to
↓ on the other hand is completely safe and makes sense.
If you're actually interested in the length of the string stored inside the text using
.textContent you'd get the result you were interested in.
* well, not really since it would still be a
U+00A0 - but I could get why people think it might be confusing in the early DOM days
Consider the following HTML snippet:
<div> <p>foo & bar 𝌆 baz</p> </div>
Let’s look up
innerHTML in the HTML Living Standard to see what happens when we run
div.innerHTML in the context of the above HTML document. Ah, it defers to the DOM Parsing spec, which says:
On getting, if the context object’s node document is an HTML document, then the attribute must return the result of running the HTML fragment serialization algorithm on the context object; […]
The HTML fragment serialization algorithm is defined in the HTML Living Standard. Following the algorithm with the
div.innerHTML example in mind, it’s clear that the first time it will descend to the “if current node is an
Element” branch under step 3.2. This adds
<p> to the output.
Then it calls the algorithm again on the text node within. This time we end up in the “if current node is a
Text node” branch. It says:
[…] Otherwise, append the value of current node’s
dataIDL attribute, escaped as described below.
data IDL attribute contains the textual contents of the element. The escaping instructions are defined as follows:
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
Replace any occurrence of the
&character by the string
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string
If the algorithm was invoked in the attribute mode, replace any occurrences of the
"character by the string
If the algorithm was not invoked in the attribute mode, replace any occurrences of the
<character by the string
<, and any occurrences of the
>character by the string
Only the abovementioned symbols are escaped as HTML entities in the result of
.innerHTML – other Unicode symbols are just displayed in their raw form, regardless of how they are represented in the HTML source code.
Because of this,
"↓" in the HTML source code turns into
"↓" when reading it back out through
innerHTML. But e.g.
"&" turn into
" " or