Why is “using anything other than a utf-8 decoder ... maybe unsafe” in the URL decoding algorithm?

I use a URL parser and ask a question about the W3C URL specification ( http://www.w3.org/TR/2014/WD-url-1-20141209/ ). In the section "2. Percent- encoded bytes" it has the following algorithm (highlighted by me):

To decode a percentage of byte order input, follow these steps:

Using nothing but the utf-8 decoder, when the input contains bytes outside the range from 0x00 to 0x7F, may be unsafe and not recommended.

  • Let the output be an empty sequence of bytes.

  • For each byte of input byte, follow these steps:

    • If the byte is not "%", add a byte to output.

    • Otherwise, if the byte is "%", and the next two bytes after entering the byte are not in the range from 0x30 to 0x39, from 0x41 to 0x46, and from 0x61 to 0x66, add the byte to output.

    • Otherwise, run the following substeps:

      • Let bytePoint be two bytes after the input byte, decoded , and then interpreted as a hexadecimal number.

      • Add a byte whose value is returned bytePoint.

      • Skip the next two bytes of input.

  • The returned result.

In the original specification, the word "decoded" (in bold) is a reference to the UTF-8 decoding algorithm. I assume this is the "utf-8 decoder" mentioned in the second sentence (in italics) above.

, UTF-8 . , , ASCII 2, , UTF-8 .

- , - , UTF-8 , , 0x30 0x39, 0x41 0x46 0x61 0x66? - ?

, 0x00 0x7f as-is ( 1, %, 2, ASCII), .

+4

Source: https://habr.com/ru/post/1598452/


All Articles