Convert to Base64 in JavaScript without the deprecated Escape call

My name is Fest.

I need to convert strings in Base64 and from Base64 to a browser via JavaScript. The topic is pretty well covered on this site and in Mozilla, and the proposed solution is similar to the following lines:

function toBase64(str) { return window.btoa(unescape(encodeURIComponent(str))); } function fromBase64(str) { return decodeURIComponent(escape(window.atob(str))); } 

I did a bit more research and found out that escape() and unescape() are deprecated and should no longer be used. With that in mind, I tried to remove calls to legacy functions that give:

 function toBase64(str) { return window.btoa(encodeURIComponent(str)); } function fromBase64(str) { return decodeURIComponent(window.atob(str)); } 

This seems to work, but it raises the following questions:

(1) Why did the original proposed solution include escape() and unescape() calls? A solution was proposed before the deadline, but supposedly these functions added some value at the time.

(2) Are there any cases of ribs where my removal of these deprecated calls will cause my shell functions to fail?

NOTE. StackOverflow has other, much more detailed and complex solutions to the string => Base64 conversion problem. I'm sure they work fine, but my question is specifically related to this popular solution.

Thank,

Fest

+5
javascript encoding base64
Jun 03 '15 at 10:29
source share
2 answers

TL; DR In principle, escape() / unescape() not needed, and your second version without obsolete functions is safe, but it generates longer base64 encoded output:

  • console.log(decodeURIComponent(atob(btoa(encodeURIComponent("€uro")))))
  • console.log(decodeURIComponent(escape(atob(btoa(unescape(encodeURIComponent("€uro")))))))

both produce "€uro" output, but a version without escape() / unescape() with a longer base64 view

  • btoa(encodeURIComponent("€uro")).length // = 16
  • btoa(unescape(encodeURIComponent("€uro"))).length // = 8

The escape() / unescape() may become necessary only if there is an analogue (for example, a non-configurable php script that expects base64 to be executed in a special way).

Long version:

First, to better understand the differences between the two versions of toBase64() and fromBase64() that you offer above, let's take a look at btoa() , which underlies the problem. The documentation says that the name btoa is mnemonic, so

“b” can mean “binary” and “a” can mean “ASCII”.

which is misleading as the documentation is in a hurry to add that

in practice, however, for historical reasons, the input and output of these functions are Unicode strings.

Even less perfect, btoa() really only accepts

characters ranging from U + 0000 to U + 00FF

btoa () only works with English alphanumeric texts.

The purpose of the encodeURIComponent () function that you use in both of your versions is to help with strings having a character outside the range of U + 0000 to U + 00FF. An example is the string "uü €", consisting of three characters

  • a (U + 0061)
  • ä (U + 00E4)
  • (U + 20AC)

Here, only the first two characters are in the range. The third character, the euro sign, is on the outside, and window.btoa("€") causes an error out of range. To avoid such an error, a solution is needed to represent "€" within the set from U + 0000 to U + 00FF. Here is what window.encodeURIComponent does:

window.encodeURIComponent("uü€")
creates the following line:
"a%C3%A4%E2%82%AC" in which some characters have been encoded

  • a = a (remained the same)
  • ä = %C3%A4 (changed to utf8 view)
  • = %E2%82%AC (view changed in utf8)

The function (changed to utf8 representation) works using the character "%" and a two-digit number for each byte of the utf8 representation of the character. "%" is U + 0025 and therefore allowed inside btoa() -range. The result of window.encodeURIComponent("uü€") can then be cast to btoa() since it no longer has characters out of range:

btoa("a%C3%A4%E2%82%AC") \\ = "YSVDMyVBNCVFMiU4MiVBQw=="

The essence of using unescape() between btoa() and encodeURIComponent() is that all bytes of the utf8 representation use up to 3 %xx characters to store all potential byte values ​​from 0x00 to 0xFF. Here unescape() can play an additional role . This is due to the fact that unescape() takes all such bytes %xx and creates in its place one Unicode character in the valid range from U + 0000 to 0 + 00FF.

For check:

  • btoa(encodeURIComponent("uü€"))).length // = 24
  • btoa(unescape(encodeURIComponent("uü€"))).length // = 8

The main difference is the reduction in the length of the text representation in base64 due to additional parsing using optional escape() / unescape() , which in the case of text in the main ASCII character set is minimal in any case.

The main lesson to understand is that the name btoa() is misleading and requires the Unicode characters U + 0000 to U + 00FF that TG443 generates. Deprecated escape() / unescape() has only a space saving function, which may be desirable, but not necessary. The Unicode> U + 00FF character problem is considered here as the Unicode btoa / atob problem , which even mentions ways to improve the whole UTF8 Unicode encoding to base64, which is possible in modern browsers.

+9
Jul 14 '15 at 16:22
source share

TL; DR / Brief Summary

Do not use btoa(encodeURIComponent(str)) and decodeURIComponent(atob(str)) - this is "stupid".

"convert string to Base64" usually means "encode the string as UTF-8 and encode the bytes as Base64" and this is exactly what btoa(unescape(encodeURIComponent(str))) does btoa(unescape(encodeURIComponent(str))) . btoa(encodeURIComponent(str)) does something else that is useless for any case that I can imagine, even if it never throws an error, as explained in humanityANDpeace's detailed answer .







What does "convert string to Base64" mean?

Base64 is a binary text encoding; a sequence of bytes is encoded as a sequence of ASCII characters. 1 Therefore, it is not possible to directly encode text as Base64. Conceptually, this is always a two-step procedure:

  1. convert string to bytes (using some character encoding )
  2. encode bytes as Base64

You can basically use any character encoding (also called character set 2 or Encoding scheme ) that you want, it just needs to be able to represent all the necessary characters, and it should be the same for both directions (text to Base64 and Base64 to text) . Since there are many different character encodings , the protocol or API must determine which one is used. If the API expects a “Base64 encoded string” and does not mention character encoding, currently it can usually be assumed that UTF-8 encoding is expected. 3

Encoding Base64 bytes from step 1 is pretty simple:
a) Take three input bytes to get 24 bits.
b) Divide into four blocks of 6 bits each to get four numbers in the range 0 ... 63.
c) Convert numbers to ASCII characters through the table and add these characters to the output
g) Go a)
More information on Base64 itself is beyond the scope of this answer.

What does btoa do?

By now, you might be thinking, “This answer may not be correct. It claims that it is not possible to directly encode text as Base64, although this is exactly what btoa does - it takes the text and spits out Base64. '

No. It does not accept text and returns Base64, it takes an argument of type string and returns Base64. But this string argument does not represent the text, it is just a weird way to preserve a sequence of bytes . Each byte is represented by a character whose numeric code point value is equal to the byte value. four

A note in the HTML standard states that “b” can mean “binary” and “a” can mean “ASCII”. “Contrary to popular belief, I don’t think btoa is badly named. It does not accept text, it accepts binary data and creates an ASCII string using Base64, so the short form of“ binary code in ascii "is absolutely the correct name. This is the type of argument that confuse.

The btoa definition in the HTML standard simply states:

[...] the user agent must convert this argument to a sequence of octets, whose nth octet is the eight-bit representation of the code point of the nth character of the argument, and then must apply the base64 algorithm to this sequence of octets and return the result.

I don’t know and probably will never know why they didn’t choose another type of argument, for example, an array of numbers. Maybe the performance was not so good at the time when btoa was first indicated?

What does unescape(encodeURIComponent(str)) do?

By now, you might be thinking, "If the first step in converting text to Base64 is encoding the text into bytes, then how btoa(unescape(encodeURIComponent(str))) achieve this? btoa does not, but neither unescape nor encodeURIComponent , doesn't seem to have anything to do with character encoding? '

In fact, encodeURIComponent refers to character encoding. The standard reads:

The encodeURIComponent function computes a new [...] URI in which each instance of certain code points is replaced [...] with escape sequences representing the UTF-8 encoding of the code point.

So now we have UTF-8 bytes as a percentage. To convert percentage-encoded bytes to a binary string suitable for btoa , you can use unescape , as the behavior description , among other things, reads:

  • If c is the code unit 0x0025 (PERCENT SIGN), then
    • [... how to decode %uXXXX ...]
    • Otherwise, if k ≤ length - 3 and [... two hexadecimal numbers follow ...] then
      • Set c to a code unit whose value is an integer represented by [...] two hexadecimal digits with indices k + 1 and k + 2 per line.

Therefore, after encodeURIComponent has saved the UTF-8 bytes as %XX , unescape turns them into separate code points exactly as btoa requires. Thus, in general, btoa(unescape(encodeURIComponent(str))) encodes the text into UTF-8 bytes, which are then encoded in Base64.

Return to the original question.

In case you forgot, the question was:

(1) Why did the originally proposed solution include calls to escape() and unescape() ? A solution was proposed before obsolescence, but presumably these features added some value at the time.

(2) Are there certain extreme cases where deleting these obsolete calls will cause the shell to fail?

Without unescape you won't get a Base64 representation of a UTF-8 encoded string. btoa(encodeURIComponent(str)) encodes the text into some strange bytes (not a standardized Unicode encoding scheme, but the bytes that can be obtained by storing the string in URI encoding as ASCII), which are then encoded as Base64. Thus, unescape necessary to comply with the standard - OK, encodeURIComponent and ASCII are also standardized, but no one expects this strange combination.

If only you convert to Base64 and vice versa, then yes, you can use btoa(encodeURIComponent(str)) , and it will never throw an error, as explained in humanityANDpeace's detailed answer (question (2) was answered enough, I think).

But in this case, you could be much better just using the result of encodeURIComponent directly. It is already pure ASCII and is always shorter than btoa(encodeURIComponent(str)) . If you want a smaller size than encodeURIComponent(str) , you can use btoa(unescape(encodeURIComponent(str))) (less if the input string contains more non-ASCII characters).

If you convert to Base64 because some other member, API, or protocol expects Base64, then you simply cannot use btoa(encodeURIComponent(str)) because no one understands the result.

Oh, and btoa(unescape(encodeURIComponent(str))) cannot be "suggested before obsolescence" unescape :
unescape was removed from the standard in version 3, the same version into which encodeURIComponent was added. unescape was still explained in the document, but was moved to Appendix B.2, in which its introduction states that "it offers unified semantics [...] without making properties or their semantics part of this standard." But since browsers should have backward compatibility, it probably won't be removed any time soon .




Try it yourself:

 function run(){ let Base64Function=new Function("str", $("#algorithm").val()); let base64=Base64Function($("#input").val()); $("#Base64Text").text("Output: "+base64); let charset=$('#charset').val(); let uri="data:text/plain" +(charset?";charset="+charset:'') +($("#interpret").prop('checked')?";base64":'') +","+base64; $("#dataURI").text(uri); $("#dataURI").attr('href', uri); $("#Base64iframe").attr('src',uri); } 
 <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <label for="input">Text to encode:</label> <input type="text" id="input" value="abc€😀"/><br /> <label for="algorithm">Encode function:</label> <input type="text" id="algorithm" size="50"/><br /> <button type="button" onclick="run();">Run</button> Defaults: <button type="button" onclick=' $("#algorithm").val("return btoa(unescape(encodeURIComponent(str)))"); $("#charset").val("UTF-8"); $("#interpret").prop("checked",true); '>UTF-8 Base64</button> <button type="button" onclick=' $("#algorithm").val("return btoa(encodeURIComponent(str))"); $("#charset").val(""); //I don't know - it not UTF-8 $("#interpret").prop("checked",true); '>wrong</button> <button type="button" onclick=' $("#algorithm").val("return encodeURIComponent(str)"); $("#charset").val("UTF-8"); $("#interpret").prop("checked",false); '>without btoa (not Base64)</button> <br /> <div id="Base64Text">Output:</div> <label for="charset">Interpret as this character encoding:</label> <input type="text" id="charset" /><br /> <label for="interpret">Interpret as Base64:</label> <input type="checkbox" id="interpret" /><br /> <div><a id="dataURI"></a></div> <iframe id="Base64iframe"></iframe> 

This snippet tests the Base64 result by creating a dataURI, but this concept applies to other Base64 applications as well.




Note:

In some quotations I use [TG442] and [TG443] to leave out or shorten things that are unimportant in my opinion.
[TG444] is obviously not part of the source.

Footnotes:

1 The standard states that Base64 is “ designed to represent arbitrary sequences of octets ” (an octet means a byte of eight bits)

2, the character set does not exactly match the character encoding. However, it can always be considered that a set of coded characters implicitly defines a character encoding, therefore the “character set” and “character encoding” are often used as synonyms. Maybe once it was the same? Sometimes the term charset is explicitly used as a short term for character encoding, rather than a character set.

3 At least UTF-8 dominates websites . Also see UTF-8 Everywhere

4 This is actually the encoding ISO_8859-1, but I would not think so. Think better bytes[i]==str.charCodeAt(i) .

+1
Aug 29 '19 at 15:16
source share



All Articles