How to remove unprintable unicode characters in multilingual input?
When users with different localizations insert strings, they will sometimes inadvertently insert non-printable characters. For instance:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC"
var displaysAs = decodeURI(weird); // Users see only "Test"
But I can’t understand how to remove non-printable characters so that they do not affect other languages, such as:
encodeURI("شنط") = "%D8%B4%D9%86%D8%B7"
encodeURI("戦艦帝国") = "%E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD"
For example, the following attempt to restore the above example does not work:
var weird = "%E2%80%AA%E2%80%8ETest%E2%80%AC";
var displaysAs = decodeURI(weird);
var stillWeird = encodeURI(displaysAs.replace(/\s/g, ""));
console.log('before:', weird);
console.log('after:', displaysAs);
console.log('again:', stillWeird);
.as-console-wrapper{min-height:100%}
Run codeHide result
, . . , , . , , -. , , " ".
: unicode, , , .