Using javascript to decode base64 decodes utf-8 strings incorrectly

I am using the Javascript window.atob() function to decode a base64 encoded string (specifically base64 encoded content from the GitHub API). The problem is that I am returning ASCII characters (for example, ⢠instead of ). How can I properly handle an incoming base64 encoded stream so that it is decoded as utf-8?

+67
javascript encoding utf-8
May 7 '15 at 16:12
source share
8 answers

There is an excellent article on Mozilla MDN docs that describes exactly this problem:

"Unicode Problem" Since DOMString is 16-bit strings, in most browsers calling window.btoa for a Unicode string will cause a Character Out Of Range exception if the character exceeds the 8-bit byte range (0x00 ~ 0xFF). There are two possible ways to solve this problem:

  • the first is to escape the entire string (with UTF-8, see encodeURIComponent ) and then encode it;
  • the second is to convert the DOMString UTF-16 to an UTF-8 character array and then encode it.

A note to previous solutions: the MDN article originally suggested using unescape and escape to solve the exception problems for Character Out Of Range , but they have been deprecated since then. Some other answers here suggest decodeURIComponent around this with decodeURIComponent and encodeURIComponent , which turned out to be unreliable and unpredictable. The latest update to this answer uses modern JavaScript features to improve code speed and modernization.

If you are trying to save time, you can also use the library:

UTF8 encoding ⇢ base64

 function b64EncodeUnicode(str) { // first we use encodeURIComponent to get percent-encoded UTF-8, // then we convert the percent encodings into raw bytes which // can be fed into btoa. return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function toSolidBytes(match, p1) { return String.fromCharCode('0x' + p1); })); } b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU=" b64EncodeUnicode('\n'); // "Cg==" 

Base64 decoding ⇢ UTF8

 function b64DecodeUnicode(str) { // Going backwards: from bytestream, to percent-encoding, to original string. return decodeURIComponent(atob(str).split('').map(function(c) { return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2); }).join('')); } b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode" b64DecodeUnicode('Cg=='); // "\n" 



Solution until 2018 (functional and although probably better supports older browsers, but not updated)

Here is the current recommendation, right from MDN, with some additional TypeScript compatibility via @ MA-Maddin:

 // Encoding UTF8 ⇢ base64 function b64EncodeUnicode(str) { return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) { return String.fromCharCode(parseInt(p1, 16)) })) } b64EncodeUnicode('✓ à la mode') // "4pyTIMOgIGxhIG1vZGU=" b64EncodeUnicode('\n') // "Cg==" // Decoding base64 ⇢ UTF8 function b64DecodeUnicode(str) { return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) { return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2) }).join('')) } b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU=') // "✓ à la mode" b64DecodeUnicode('Cg==') // "\n" 



Original solution (deprecated)

They use escape and unescape (which are now deprecated, although this works in all modern browsers):

 function utf8_to_b64( str ) { return window.btoa(unescape(encodeURIComponent( str ))); } function b64_to_utf8( str ) { return decodeURIComponent(escape(window.atob( str ))); } // Usage: utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU=" b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode" 



One last thing: I first encountered this problem when calling the GitHub API. To get this to work properly on (Mobile) Safari, I actually had to remove all the empty space from the base64 source before I could even decode the source. Whether this will be relevant in 2017, I do not know:

 function b64_to_utf8( str ) { str = str.replace(/\s/g, ''); return decodeURIComponent(escape(window.atob( str ))); } 
+175
May 7 '15 at 16:16
source share

Things are changing. Escape / unescape methods are deprecated.

You can encode a URI string before encoding it with Base64. Note that this does not produce Base64 UTF8 encoding, but rather Base64 encoded data is URL encoded. Both parties must agree on the same encoding.

See a working example here: http://codepen.io/anon/pen/PZgbPW

 // encode string var base64 = window.btoa(encodeURIComponent('€ 你好 æøåÆØÅ')); // decode string var str = decodeURIComponent(window.atob(tmp)); // str is now === '€ 你好 æøåÆØÅ' 

To solve the OP problem, a third-party library such as js-base64 should solve the problem.

+14
Feb 17 '16 at 7:03
source share

If you process strings as more bytes, you can use the following functions

 function u_atob(ascii) { return Uint8Array.from(atob(ascii), c => c.charCodeAt(0)); } function u_btoa(buffer) { var binary = []; var bytes = new Uint8Array(buffer); for (var i = 0, il = bytes.byteLength; i < il; i++) { binary.push(String.fromCharCode(bytes[i])); } return btoa(binary.join('')); } // example, it works also with astral plane characters such as '𝒞' var encodedString = new TextEncoder().encode('✓'); var base64String = u_btoa(encodedString); console.log('✓' === new TextDecoder().decode(u_atob(base64String))) 
+9
Apr 07 '17 at 6:28
source share

I would suggest that you might want a solution that produces the widely used base64 URI. Please visit data:text/plain;charset=utf-8;base64,4pi44pi54pi64pi74pi84pi+4pi/ to see a demo (copy the data uri, open a new tab, paste the data URI in the address bar, then press enter to go to the page). Despite the fact that this URI is encoded in base64, the browser can still recognize the upper code points and decode them correctly. The minimum encoder + decoder is 1058 bytes (+ Gzip → 589 bytes)

 !function(e){"use strict";function h(b){var a=b.charCodeAt(0);if(55296<=a&&56319>=a)if(b=b.charCodeAt(1),b===b&&56320<=b&&57343>=b){if(a=1024*(a-55296)+b-56320+65536,65535<a)return d(240|a>>>18,128|a>>>12&63,128|a>>>6&63,128|a&63)}else return d(239,191,189);return 127>=a?inputString:2047>=a?d(192|a>>>6,128|a&63):d(224|a>>>12,128|a>>>6&63,128|a&63)}function k(b){var a=b.charCodeAt(0)<<24,f=l(~a),c=0,e=b.length,g="";if(5>f&&e>=f){a=a<<f>>>24+f;for(c=1;c<f;++c)a=a<<6|b.charCodeAt(c)&63;65535>=a?g+=d(a):1114111>=a?(a-=65536,g+=d((a>>10)+55296,(a&1023)+56320)):c=0}for(;c<e;++c)g+="\ufffd";return g}var m=Math.log,n=Math.LN2,l=Math.clz32||function(b){return 31-m(b>>>0)/n|0},d=String.fromCharCode,p=atob,q=btoa;e.btoaUTF8=function(b,a){return q((a?"\u00ef\u00bb\u00bf":"")+b.replace(/[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g,h))};e.atobUTF8=function(b,a){a||"\u00ef\u00bb\u00bf"!==b.substring(0,3)||(b=b.substring(3));return p(b).replace(/[\xc0-\xff][\x80-\xbf]*/g,k)}}(""+void 0==typeof global?""+void 0==typeof self?this:self:global) 

Below is the source code used to generate it.

 var fromCharCode = String.fromCharCode; var btoaUTF8 = (function(btoa, replacer){"use strict"; return function(inputString, BOMit){ return btoa((BOMit ? "\xEF\xBB\xBF" : "") + inputString.replace( /[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g, replacer )); } })(btoa, function(nonAsciiChars){"use strict"; // make the UTF string into a binary UTF-8 encoded string var point = nonAsciiChars.charCodeAt(0); if (point >= 0xD800 && point <= 0xDBFF) { var nextcode = nonAsciiChars.charCodeAt(1); if (nextcode !== nextcode) // NaN because string is 1 code point long return fromCharCode(0xef/*11101111*/, 0xbf/*10111111*/, 0xbd/*10111101*/); // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae if (nextcode >= 0xDC00 && nextcode <= 0xDFFF) { point = (point - 0xD800) * 0x400 + nextcode - 0xDC00 + 0x10000; if (point > 0xffff) return fromCharCode( (0x1e/*0b11110*/<<3) | (point>>>18), (0x2/*0b10*/<<6) | ((point>>>12)&0x3f/*0b00111111*/), (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/), (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/) ); } else return fromCharCode(0xef, 0xbf, 0xbd); } if (point <= 0x007f) return nonAsciiChars; else if (point <= 0x07ff) { return fromCharCode((0x6<<5)|(point>>>6), (0x2<<6)|(point&0x3f)); } else return fromCharCode( (0xe/*0b1110*/<<4) | (point>>>12), (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/), (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/) ); }); 

Then, to decode the data, base64 either HTTP receives the data as a data URI, or uses the function below.

 var clz32 = Math.clz32 || (function(log, LN2){"use strict"; return function(x) {return 31 - log(x >>> 0) / LN2 | 0}; })(Math.log, Math.LN2); var fromCharCode = String.fromCharCode; var atobUTF8 = (function(atob, replacer){"use strict"; return function(inputString, keepBOM){ inputString = atob(inputString); if (!keepBOM && inputString.substring(0,3) === "\xEF\xBB\xBF") inputString = inputString.substring(3); // eradicate UTF-8 BOM // 0xc0 => 0b11000000; 0xff => 0b11111111; 0xc0-0xff => 0b11xxxxxx // 0x80 => 0b10000000; 0xbf => 0b10111111; 0x80-0xbf => 0b10xxxxxx return inputString.replace(/[\xc0-\xff][\x80-\xbf]*/g, replacer); } })(atob, function(encoded){"use strict"; var codePoint = encoded.charCodeAt(0) << 24; var leadingOnes = clz32(~codePoint); var endPos = 0, stringLen = encoded.length; var result = ""; if (leadingOnes < 5 && stringLen >= leadingOnes) { codePoint = (codePoint<<leadingOnes)>>>(24+leadingOnes); for (endPos = 1; endPos < leadingOnes; ++endPos) codePoint = (codePoint<<6) | (encoded.charCodeAt(endPos)&0x3f/*0b00111111*/); if (codePoint <= 0xFFFF) { // BMP code point result += fromCharCode(codePoint); } else if (codePoint <= 0x10FFFF) { // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae codePoint -= 0x10000; result += fromCharCode( (codePoint >> 10) + 0xD800, // highSurrogate (codePoint & 0x3ff) + 0xDC00 // lowSurrogate ); } else endPos = 0; // to fill it in with INVALIDs } for (; endPos < stringLen; ++endPos) result += "\ufffd"; // replacement character return result; }); 

The advantage of greater standardization is that this encoder and this decoder are more widely used since they can be used as a valid URL that displays correctly. Note.

 (function(window){ "use strict"; var sourceEle = document.getElementById("source"); var urlBarEle = document.getElementById("urlBar"); var mainFrameEle = document.getElementById("mainframe"); var gotoButton = document.getElementById("gotoButton"); var parseInt = window.parseInt; var fromCodePoint = String.fromCodePoint; var parse = JSON.parse; function unescape(str){ return str.replace(/\\u[\da-f]{0,4}|\\x[\da-f]{0,2}|\\u{[^}]*}|\\[bfnrtv"'\\]|\\0[0-7]{1,3}|\\\d{1,3}/g, function(match){ try{ if (match.startsWith("\\u{")) return fromCodePoint(parseInt(match.slice(2,-1),16)); if (match.startsWith("\\u") || match.startsWith("\\x")) return fromCodePoint(parseInt(match.substring(2),16)); if (match.startsWith("\\0") && match.length > 2) return fromCodePoint(parseInt(match.substring(2),8)); if (/^\\\d/.test(match)) return fromCodePoint(+match.slice(1)); }catch(e){return "\ufffd".repeat(match.length)} return parse('"' + match + '"'); }); } function whenChange(){ try{ urlBarEle.value = "data:text/plain;charset=UTF-8;base64," + btoaUTF8(unescape(sourceEle.value), true); } finally{ gotoURL(); } } sourceEle.addEventListener("change",whenChange,{passive:1}); sourceEle.addEventListener("input",whenChange,{passive:1}); // IFrame Setup: function gotoURL(){mainFrameEle.src = urlBarEle.value} gotoButton.addEventListener("click", gotoURL, {passive: 1}); function urlChanged(){urlBarEle.value = mainFrameEle.src} mainFrameEle.addEventListener("load", urlChanged, {passive: 1}); urlBarEle.addEventListener("keypress", function(evt){ if (evt.key === "enter") evt.preventDefault(), urlChanged(); }, {passive: 1}); var fromCharCode = String.fromCharCode; var btoaUTF8 = (function(btoa, replacer){ "use strict"; return function(inputString, BOMit){ return btoa((BOMit?"\xEF\xBB\xBF":"") + inputString.replace( /[\x80-\uD7ff\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?/g, replacer )); } })(btoa, function(nonAsciiChars){ "use strict"; // make the UTF string into a binary UTF-8 encoded string var point = nonAsciiChars.charCodeAt(0); if (point >= 0xD800 && point <= 0xDBFF) { var nextcode = nonAsciiChars.charCodeAt(1); if (nextcode !== nextcode) { // NaN because string is 1code point long return fromCharCode(0xef/*11101111*/, 0xbf/*10111111*/, 0xbd/*10111101*/); } // https://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae if (nextcode >= 0xDC00 && nextcode <= 0xDFFF) { point = (point - 0xD800) * 0x400 + nextcode - 0xDC00 + 0x10000; if (point > 0xffff) { return fromCharCode( (0x1e/*0b11110*/<<3) | (point>>>18), (0x2/*0b10*/<<6) | ((point>>>12)&0x3f/*0b00111111*/), (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/), (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/) ); } } else { return fromCharCode(0xef, 0xbf, 0xbd); } } if (point <= 0x007f) { return inputString; } else if (point <= 0x07ff) { return fromCharCode((0x6<<5)|(point>>>6), (0x2<<6)|(point&0x3f/*00111111*/)); } else { return fromCharCode( (0xe/*0b1110*/<<4) | (point>>>12), (0x2/*0b10*/<<6) | ((point>>>6)&0x3f/*0b00111111*/), (0x2/*0b10*/<<6) | (point&0x3f/*0b00111111*/) ); } }); setTimeout(whenChange, 0); })(window); 
 img:active{opacity:0.8} 
 <center> <textarea id="source" style="width:66.7vw">Hello \u1234 W\186\0256ld! Enter text into the top box. Then the URL will update automatically. </textarea><br /> <div style="width:66.7vw;display:inline-block;height:calc(25vw + 1em + 6px);border:2px solid;text-align:left;line-height:1em"> <input id="urlBar" style="width:calc(100% - 1em - 13px)" /><img id="gotoButton" src="" style="width:calc(1em + 4px);line-height:1em;vertical-align:-40%;cursor:pointer" /> <iframe id="mainframe" style="width:66.7vw;height:25vw" frameBorder="0"></iframe> </div> </center> + C9zIDHwCGnH5IvZAOAAABmUlEQVQoz7WS25acIBBFkRLkIgKKtOCttbv // xdDmTGZzHv2S63ltuBQQP4rdRiRUP8UK4wh6nVddQwj / NtDQTvac8577zTQb72zj65 / 876qqt7wykU6 / 1U6vFEgjE1mt / 5LRqrpu7oVsn0sjZejMfxR3W / yLikqAFcUx93YxLmZGOtElmEu6Ufd9xV3ZDTGcEvGLbMk0mHHlUSvS5svCwS + hVL8loQQyfpI1Ay8RF / xlNxcsTchGjGDIuBG3Ik7TMyNxn8m0TSnBAK6Z8UZfp3IbAonmJvmsEACum6aNv7B0CnvpezDcNhw9XWsuAr7qnRg6dABmeM4dTgn / DZdXWs3LMspZ1KDMt1kcPJ6S1icWNp2qaEmjq6myx7jbQK3VKItLJaW5FR + cuYlRhYNKzGa9vF4vM5roLW3OSVjkmiGJrPhUq301 / 16pVKZRGFYWjTP50spTxBN5Z4EKnSonruk + n4tUokv1aJSEl / MLZU90S3L6 / U6o0J142iQVp3HcZxKSo8LfkNRCtJaKYFSRX7iaoAAUDty8wvWYR6HJEepdwAAAABJRU5ErkJggg == "style =" width: calc (1em + <center> <textarea id="source" style="width:66.7vw">Hello \u1234 W\186\0256ld! Enter text into the top box. Then the URL will update automatically. </textarea><br /> <div style="width:66.7vw;display:inline-block;height:calc(25vw + 1em + 6px);border:2px solid;text-align:left;line-height:1em"> <input id="urlBar" style="width:calc(100% - 1em - 13px)" /><img id="gotoButton" src="" style="width:calc(1em + 4px);line-height:1em;vertical-align:-40%;cursor:pointer" /> <iframe id="mainframe" style="width:66.7vw;height:25vw" frameBorder="0"></iframe> </div> </center> 

Besides the fact that the above code snippets are very standardized, they are also very fast. Instead of an indirect succession chain, when data needs to be transformed several times between different forms (for example, in Riccardo Galli's answer), the above code snippet is as straightforward as possible. It uses only one simple quick call to String.prototype.replace to process data during encoding and only one to decode data during decoding. Another plus is that (especially for large strings), String.prototype.replace allows the browser to automatically handle basic memory management when changing the size of the string, which leads to a significant increase in performance, especially in evergreen browsers like Chrome and Firefox, which greatly optimize String.prototype.replace . Finally, the highlight is that for you the Latin script excludes users, lines that do not contain code points above 0x7f are processed very quickly because the line remains unchanged by the replacement algorithm.

I created a github repository for this solution at https://github.com/anonyco/BestBase64EncoderDecoder/

+2
Nov 22 '18 at 14:52
source share

Here is the updated 2018 solution described in Mozilla Development Resources.

ENCODE FROM UNICODE TO B64

 function b64EncodeUnicode(str) { // first we use encodeURIComponent to get percent-encoded UTF-8, // then we convert the percent encodings into raw bytes which // can be fed into btoa. return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function toSolidBytes(match, p1) { return String.fromCharCode('0x' + p1); })); } b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU=" b64EncodeUnicode('\n'); // "Cg==" 

DECODE FROM B64 TO UNICODE

 function b64DecodeUnicode(str) { // Going backwards: from bytestream, to percent-encoding, to original string. return decodeURIComponent(atob(str).split('').map(function(c) { return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2); }).join('')); } b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode" b64DecodeUnicode('Cg=='); // "\n" 
+1
04 Oct '18 at 13:02
source share

Minor corrections, unescape and escape are deprecated, therefore:

 function utf8_to_b64( str ) { return window.btoa(decodeURIComponent(encodeURIComponent(str))); } function b64_to_utf8( str ) { return decodeURIComponent(encodeURIComponent(window.atob(str))); } function b64_to_utf8( str ) { str = str.replace(/\s/g, ''); return decodeURIComponent(encodeURIComponent(window.atob(str))); } 
0
Dec 08 '15 at 11:32
source share

Here is some future code for browsers that may lack escape/unescape() . Note that IE 9 and atob/btoa() do not support atob/btoa() , so you will need to use base64 custom functions for them.

 // Polyfill for escape/unescape if( !window.unescape ){ window.unescape = function( s ){ return s.replace( /%([0-9A-F]{2})/g, function( m, p ) { return String.fromCharCode( '0x' + p ); } ); }; } if( !window.escape ){ window.escape = function( s ){ var chr, hex, i = 0, l = s.length, out = ''; for( ; i < l; i ++ ){ chr = s.charAt( i ); if( chr.search( /[A-Za-z0-9\@\*\_\+\-\.\/]/ ) > -1 ){ out += chr; continue; } hex = s.charCodeAt( i ).toString( 16 ); out += '%' + ( hex.length % 2 != 0 ? '0' : '' ) + hex; } return out; }; } // Base64 encoding of UTF-8 strings var utf8ToB64 = function( s ){ return btoa( unescape( encodeURIComponent( s ) ) ); }; var b64ToUtf8 = function( s ){ return decodeURIComponent( escape( atob( s ) ) ); }; 

A more complete example of UTF-8 encoding and decoding can be found here: http://jsfiddle.net/47zwb41o/

0
Jan 18 '17 at 6:46
source share

If you are still experiencing a problem, enable the above solution, try as shown below. Consider the case where escape is not supported for TS.

 blob = new Blob(["\ufeff", csv_content]); // this will make symbols to appears in excel 

for csv_content you can try as below.

 function b64DecodeUnicode(str: any) { return decodeURIComponent(atob(str).split('').map((c: any) => { return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2); }).join('')); } 
0
Aug 08 '18 at 9:47
source share



All Articles