当前位置: 动力学知识库 > 问答 > 编程问答 >

unicode - Difference in HTML Entity length in JavaScript

问题描述:

Why does the entity   have length 6 while the entity ↓ has length 1? Is this in the spec somewhere? (Tested in Firefox, Chrome and Safari.)

JSFiddle

网友答案:

I agree that this is very weird behavior, but at least it's specified.

The HTML fragment serialization algorithm states that:

Escaping a string (for the purposes of the algorithm above) consists of replacing any occurrences of the "&" character by the string "&", any occurrences of the "<" character by the string "<", any occurrences of the ">" character by the string ">", any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ", and, if the algorithm was invoked in the attribute mode, any occurrences of the """ character by the string """.

Emphasis by me. If I had to guess this is to support backwards compatibility in older browsers that did this and to get consistent behavior when deserializing and serializing strings. If the browser serialized the DOM tree result of <div>&nbsp;&nbsp;</div> to <div> </div> deserializing it to the DOM tree again would result in a single space*. This is pretty much the only way the browser can achieve consistent behavior.

The replacement to &darr; on the other hand is completely safe and makes sense.

If you're actually interested in the length of the string stored inside the text using .textContent you'd get the result you were interested in.

* well, not really since it would still be a &nbsp; U+00A0 - but I could get why people think it might be confusing in the early DOM days

网友答案:

Consider the following HTML snippet:

<div>
  <p>foo &amp; bar &#x1D306; baz</p>
</div>

Let’s look up innerHTML in the HTML Living Standard to see what happens when we run div.innerHTML in the context of the above HTML document. Ah, it defers to the DOM Parsing spec, which says:

On getting, if the context object’s node document is an HTML document, then the attribute must return the result of running the HTML fragment serialization algorithm on the context object; […]

The HTML fragment serialization algorithm is defined in the HTML Living Standard. Following the algorithm with the div.innerHTML example in mind, it’s clear that the first time it will descend to the “if current node is an Element” branch under step 3.2. This adds <p> to the output.

Then it calls the algorithm again on the text node within. This time we end up in the “if current node is a Text node” branch. It says:

[…] Otherwise, append the value of current node’s data IDL attribute, escaped as described below.

The data IDL attribute contains the textual contents of the element. The escaping instructions are defined as follows:

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

  1. Replace any occurrence of the & character by the string &amp;.

  2. Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string &nbsp;.

  3. If the algorithm was invoked in the attribute mode, replace any occurrences of the " character by the string &quot;.

  4. If the algorithm was not invoked in the attribute mode, replace any occurrences of the < character by the string &lt;, and any occurrences of the > character by the string &gt;.

Only the abovementioned symbols are escaped as HTML entities in the result of .innerHTML – other Unicode symbols are just displayed in their raw form, regardless of how they are represented in the HTML source code.

Because of this, "&darr;" in the HTML source code turns into "↓" when reading it back out through innerHTML. But e.g. "&amp;" or "&#x26;" turn into "&amp;", and "&nbsp;" or &#xA0; become "&nbsp;".

分享给朋友:
您可能感兴趣的文章:
随机阅读: