TL; DR / Brief Summary
Do not use btoa(encodeURIComponent(str)) and decodeURIComponent(atob(str)) - this is "stupid".
"convert string to Base64" usually means "encode the string as UTF-8 and encode the bytes as Base64" and this is exactly what btoa(unescape(encodeURIComponent(str))) does btoa(unescape(encodeURIComponent(str))) . btoa(encodeURIComponent(str)) does something else that is useless for any case that I can imagine, even if it never throws an error, as explained in humanityANDpeace's detailed answer .
What does "convert string to Base64" mean?
Base64 is a binary text encoding; a sequence of bytes is encoded as a sequence of ASCII characters. 1 Therefore, it is not possible to directly encode text as Base64. Conceptually, this is always a two-step procedure:
- convert string to bytes (using some character encoding )
- encode bytes as Base64
You can basically use any character encoding (also called character set 2 or Encoding scheme ) that you want, it just needs to be able to represent all the necessary characters, and it should be the same for both directions (text to Base64 and Base64 to text) . Since there are many different character encodings , the protocol or API must determine which one is used. If the API expects a “Base64 encoded string” and does not mention character encoding, currently it can usually be assumed that UTF-8 encoding is expected. 3
Encoding Base64 bytes from step 1 is pretty simple:
a) Take three input bytes to get 24 bits.
b) Divide into four blocks of 6 bits each to get four numbers in the range 0 ... 63.
c) Convert numbers to ASCII characters through the table and add these characters to the output
g) Go a)
More information on Base64 itself is beyond the scope of this answer.
What does btoa do?
By now, you might be thinking, “This answer may not be correct. It claims that it is not possible to directly encode text as Base64, although this is exactly what btoa does - it takes the text and spits out Base64. '
No. It does not accept text and returns Base64, it takes an argument of type string and returns Base64. But this string argument does not represent the text, it is just a weird way to preserve a sequence of bytes . Each byte is represented by a character whose numeric code point value is equal to the byte value. four
A note in the HTML standard states that “b” can mean “binary” and “a” can mean “ASCII”. “Contrary to popular belief, I don’t think btoa is badly named. It does not accept text, it accepts binary data and creates an ASCII string using Base64, so the short form of“ binary code in ascii "is absolutely the correct name. This is the type of argument that confuse.
The btoa definition in the HTML standard simply states:
[...] the user agent must convert this argument to a sequence of octets, whose nth octet is the eight-bit representation of the code point of the nth character of the argument, and then must apply the base64 algorithm to this sequence of octets and return the result.
I don’t know and probably will never know why they didn’t choose another type of argument, for example, an array of numbers. Maybe the performance was not so good at the time when btoa was first indicated?
What does unescape(encodeURIComponent(str)) do?
By now, you might be thinking, "If the first step in converting text to Base64 is encoding the text into bytes, then how btoa(unescape(encodeURIComponent(str))) achieve this? btoa does not, but neither unescape nor encodeURIComponent , doesn't seem to have anything to do with character encoding? '
In fact, encodeURIComponent refers to character encoding. The standard reads:
The encodeURIComponent function computes a new [...] URI in which each instance of certain code points is replaced [...] with escape sequences representing the UTF-8 encoding of the code point.
So now we have UTF-8 bytes as a percentage. To convert percentage-encoded bytes to a binary string suitable for btoa , you can use unescape , as the behavior description , among other things, reads:
- If c is the code unit 0x0025 (PERCENT SIGN), then
- [... how to decode
%uXXXX ...] - Otherwise, if k ≤ length - 3 and [... two hexadecimal numbers follow ...] then
- Set c to a code unit whose value is an integer represented by [...] two hexadecimal digits with indices k + 1 and k + 2 per line.
Therefore, after encodeURIComponent has saved the UTF-8 bytes as %XX , unescape turns them into separate code points exactly as btoa requires. Thus, in general, btoa(unescape(encodeURIComponent(str))) encodes the text into UTF-8 bytes, which are then encoded in Base64.
Return to the original question.
In case you forgot, the question was:
(1) Why did the originally proposed solution include calls to escape() and unescape() ? A solution was proposed before obsolescence, but presumably these features added some value at the time.
(2) Are there certain extreme cases where deleting these obsolete calls will cause the shell to fail?
Without unescape you won't get a Base64 representation of a UTF-8 encoded string. btoa(encodeURIComponent(str)) encodes the text into some strange bytes (not a standardized Unicode encoding scheme, but the bytes that can be obtained by storing the string in URI encoding as ASCII), which are then encoded as Base64. Thus, unescape necessary to comply with the standard - OK, encodeURIComponent and ASCII are also standardized, but no one expects this strange combination.
If only you convert to Base64 and vice versa, then yes, you can use btoa(encodeURIComponent(str)) , and it will never throw an error, as explained in humanityANDpeace's detailed answer (question (2) was answered enough, I think).
But in this case, you could be much better just using the result of encodeURIComponent directly. It is already pure ASCII and is always shorter than btoa(encodeURIComponent(str)) . If you want a smaller size than encodeURIComponent(str) , you can use btoa(unescape(encodeURIComponent(str))) (less if the input string contains more non-ASCII characters).
If you convert to Base64 because some other member, API, or protocol expects Base64, then you simply cannot use btoa(encodeURIComponent(str)) because no one understands the result.
Oh, and btoa(unescape(encodeURIComponent(str))) cannot be "suggested before obsolescence" unescape :
unescape was removed from the standard in version 3, the same version into which encodeURIComponent was added. unescape was still explained in the document, but was moved to Appendix B.2, in which its introduction states that "it offers unified semantics [...] without making properties or their semantics part of this standard." But since browsers should have backward compatibility, it probably won't be removed any time soon .
Try it yourself:
function run(){ let Base64Function=new Function("str", $("#algorithm").val()); let base64=Base64Function($("#input").val()); $("#Base64Text").text("Output: "+base64); let charset=$('#charset').val(); let uri="data:text/plain" +(charset?";charset="+charset:'') +($("#interpret").prop('checked')?";base64":'') +","+base64; $("#dataURI").text(uri); $("#dataURI").attr('href', uri); $("#Base64iframe").attr('src',uri); }
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <label for="input">Text to encode:</label> <input type="text" id="input" value="abc€😀"/><br /> <label for="algorithm">Encode function:</label> <input type="text" id="algorithm" size="50"/><br /> <button type="button" onclick="run();">Run</button> Defaults: <button type="button" onclick=' $("#algorithm").val("return btoa(unescape(encodeURIComponent(str)))"); $("#charset").val("UTF-8"); $("#interpret").prop("checked",true); '>UTF-8 Base64</button> <button type="button" onclick=' $("#algorithm").val("return btoa(encodeURIComponent(str))"); $("#charset").val(""); //I don't know - it not UTF-8 $("#interpret").prop("checked",true); '>wrong</button> <button type="button" onclick=' $("#algorithm").val("return encodeURIComponent(str)"); $("#charset").val("UTF-8"); $("#interpret").prop("checked",false); '>without btoa (not Base64)</button> <br /> <div id="Base64Text">Output:</div> <label for="charset">Interpret as this character encoding:</label> <input type="text" id="charset" /><br /> <label for="interpret">Interpret as Base64:</label> <input type="checkbox" id="interpret" /><br /> <div><a id="dataURI"></a></div> <iframe id="Base64iframe"></iframe>
This snippet tests the Base64 result by creating a dataURI, but this concept applies to other Base64 applications as well.
Note:
In some quotations I use [TG442] and [TG443] to leave out or shorten things that are unimportant in my opinion.
[TG444] is obviously not part of the source.
Footnotes:
1 The standard states that Base64 is “ designed to represent arbitrary sequences of octets ” (an octet means a byte of eight bits)
2, the character set does not exactly match the character encoding. However, it can always be considered that a set of coded characters implicitly defines a character encoding, therefore the “character set” and “character encoding” are often used as synonyms. Maybe once it was the same? Sometimes the term charset is explicitly used as a short term for character encoding, rather than a character set.
3 At least UTF-8 dominates websites . Also see UTF-8 Everywhere
4 This is actually the encoding ISO_8859-1, but I would not think so. Think better bytes[i]==str.charCodeAt(i) .