Can regular expressions find duplicate characters?

Question

Can regular expressions find duplicate characters?

My users insert sequences like

________________________ ************************ ------------------------ ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥

for formatting documents (do not ask me about my users!). And it looks bad when displaying fragments . How to remove repeats of any characters? I can add separate filters, but it will be a constant game for cats and mice.

Can a regular expression filter these?

+6

regex

aitchnyu Oct 12 '11 at 6:00

source share

3 answers

You can remove repetitions of any character with a simple regular expression, e.g. (.)\1+

However, this will also catch legal uses, for example, words that have doubled letters in their spelling (balloon, spelling, well, etc.).

Thus, you probably want to limit the expression to some forbidden characters, in the end, keeping it as general as possible so that you don't have to change it from time to time, as your users find new characters to use.
One possible solution would be to prohibit repeated characters other than letters and not numbers:

([^A-Za-z0-9])\1+

But even this is not the final solution for all cases, as some of your users may actually use actual letter sequences as separators:

 ZZZZZZZZZZZZZZZZZZZZZZ BBBBBBBBBBBBBBBBBBBBBB ZZZZZZZZZZZZZZZZZZZZZZ

To prevent this, and with the added benefit of allowing the legitimate use of certain repeated non-letter characters (for example, in ellipsis: ...), you can limit the repetition of characters to a maximum of 3 using a regular expression with the syntax (<pattern>)\1{min, max} as follows: (.)\1{4,} to match offensive character sequences with a minimum length of 4 and an undefined maximum.

+5

luvieere Oct 12 '11 at 6:02

source share

In python (but the logic is the same regardless of language):

 >>> import re >>> text = ''' ... This is some text ... ________________________ ... This some more ... ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥ ... Truly the last line ... ''' >>> print re.sub(r'[_♥]{2,}', '', text) #this is the core (regexp) This is some text This some more Truly the last line

This has the advantage that you have some control over what to replace and what not (for example, you can not substitute . , As this may be part of a comment like This is still to do...

EDIT:

If your repetitions are always "strings", you can add newline characters to your expression:

 text = ''' This is some text ________________________ This some more ♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥♥ Truly the last line But this is not to be changed: ♥♥♥ ''' >>> print re.sub(r'\n[_♥]{2,}\n', '\n', text) This is some text This some more Truly the last line But this is not to be changed: ♥♥♥

NTN

+1

mac Oct 12 '11 at 6:10

source share

Simon buchan · Accepted Answer · 2011-10-12T06:04:31+0000

Try something like:

 (.)\1{5,}

Matches any character, and then 5 or more characters. Remember to avoid \ if your language uses strings for regex patterns!

Can regular expressions find duplicate characters?

More articles: