Remove jargon, but save real characters

Question

Remove jargon, but save real characters

I get a spam bombardment with messages as shown below, so the best and most efficient way to remove all jargon from something like this:

<texarea id="comment">ȑ̉̽ͧ̔͆ͦ̊͛̿͗҉̷̢̧̫̗̗͎͈͕e̷̪͓̼̼̣̻̻͙͔̳̘̗͙̬̱͎ͭ̃͗ͩͯͥͬ̂ͧ͐͌̑̅͢͜ͅd̴̦̺̖̣͎̲̥͕̗̺̯̤͗ͬ͌ͧ̓͒ͭ́̋ͩͥ͊̇̓̌ͫ̃́́͠</textarea>

I accept RegEx, but what exactly caused these things and how will it be stated in RegExp? The problem lies in the <textarea> , and after extracting the value, I would like to remove all this jargon from the value and show it only the real characters, which in this case should be red .

Other types of Unicode characters are allowed, but not characters that stack on top of each other.

+4

javascript html

Shaz Apr 10 '11 at 3:07

source share

2 answers

Removing combined diacritics will make entering some languages (such as Vietnamese) difficult or impossible, so you should reconsider.

+3

Ignacio Vazquez-Abrams Apr 10 '11 at 3:08

source share

Ken rockot · Accepted Answer · 2011-04-10T03:13:19+0000

Zalgo is waiting outside the wall.

You want to filter out a combination of characters, such as diacritics, listed here .

You should get away with a simple character class pattern matching, i.e.:

fooString.replace(/[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/, "");

If you want to limit the content to one combination per character (not so that it really mitigates all the negative side effects), you could simply use

fooString.replace(/([\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f])[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]*/, "$1");

EDIT: Added a number of other combined character ranges. This is most likely still not exhaustive.

Remove jargon, but save real characters

More articles: