Remove jargon, but save real characters

I get a spam bombardment with messages as shown below, so the best and most efficient way to remove all jargon from something like this:

<texarea id="comment">ȑ̉̽ͧ̔͆ͦ̊͛̿͗҉̷̢̧̫̗̗͎͈͕e̷̪͓̼̼̣̻̻͙͔̳̘̗͙̬̱͎ͭ̃͗ͩͯͥͬ̂ͧ͐͌̑̅͢͜ͅd̴̦̺̖̣͎̲̥͕̗̺̯̤͗ͬ͌ͧ̓͒ͭ́̋ͩͥ͊̇̓̌ͫ̃́́͠</textarea> 

I accept RegEx, but what exactly caused these things and how will it be stated in RegExp? The problem lies in the <textarea> , and after extracting the value, I would like to remove all this jargon from the value and show it only the real characters, which in this case should be red .

Other types of Unicode characters are allowed, but not characters that stack on top of each other.

+4
source share
2 answers

Zalgo is waiting outside the wall.

You want to filter out a combination of characters, such as diacritics, listed here .

You should get away with a simple character class pattern matching, i.e.:

fooString.replace(/[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/, "");

If you want to limit the content to one combination per character (not so that it really mitigates all the negative side effects), you could simply use

fooString.replace(/([\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f])[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]*/, "$1");

EDIT: Added a number of other combined character ranges. This is most likely still not exhaustive.

+4
source

Removing combined diacritics will make entering some languages ​​(such as Vietnamese) difficult or impossible, so you should reconsider.

+3
source

Source: https://habr.com/ru/post/1347408/


All Articles