Match as many repetitions of a character as repetitions of a captured group

Question

Match as many repetitions of a character as repetitions of a captured group

I would like to clear some data that was written from my keyboard using python and regex. Especially when the reverse space was used to correct the error.

Example 1:

[in]: 'Helloo<BckSp> world' [out]: 'Hello world'

This can be done using

 re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

Example 2:
However, when I have several backspaces, I don’t know how to remove exactly the same number of characters before:

 [in]: 'Helllo<BckSp><BckSp>o world' [out]: 'Hello world'

(Here I want to remove "l" and "o" in front of the two reverse windows).

I could just use re.sub(r'[^>]<BckSp>', '', line) several times until <BckSp> left, but I would like to find a more elegant / quick solution.

Does anyone know how to do this?

+5

python regex backreference

Louis M Dec 27 '16 at 10:27

source share

5 answers

Fallenhero · Answer 1 · 2016-12-27T10:41:08+0000

Python doesn't seem to support recursive regex. If you can use a different language, you can try the following:

 .(?R)?<BckSp>

See: https://regex101.com/r/OirPNn/1

Casimir et Hippolyte · Answer 2 · 2016-12-27T10:44:22+0000

This is not very efficient, but you can do it with the re module:

 (?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

demo

Thus, you do not need to count, the pattern uses only repetition.

 (?: [^<] # a character to remove (?= # lookahead to reach the corresponding <BckSp> [^<]* # skip characters until the first <BckSp> ( # capture group 1: contains the <BckSp>s (?=(\1?))\2 # emulate an atomic group in place of \1?+ # The idea is to add the <BcKSp>s already matched in the # previous repetitions if any to be sure that the following # <BckSp> isn't already associated with a character <BckSp> # corresponding <BckSp> ) ) )+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp> \1 # matches all the consecutive <BckSp> and ensures that there no more character # between the last character to remove and the first <BckSp>

You can do the same with the regex module, but this time you do not need to emulate the possessive quantifier:

 (?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

demo

But with the regex module you can also use recursion (as @Fallenhero noted):

 [^<](?R)?<BckSp>

demonstration

Wiktor stribiżew · Answer 3 · 2016-12-27T10:39:54+0000

Since there is no support for recursion / subroutine calls, in Python re there are no atomic groups / possessive quantifiers, you can remove these characters, followed by backspaces in the loop:

 import re s = "Helllo\b\bo world" r = re.compile("^\b+|[^\b]\b") while r.search(s): s = r.sub("", s) print(s)

See Python Demo

The pattern "^\b+|[^\b]\b" will find 1+ inverse characters at the beginning of the line (c ^\b+ ) and [^\b]\b will find all non-overlapping occurrences of any char except the backspace followed by back space.

The same approach if backspace is expressed as some enitity / tag, such as the <BckSp> literal:

 import re s = "Helllo<BckSp><BckSp>o world" r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S) while r.search(s): s = r.sub("", s) print(s)

See another Python demo

niemmi · Answer 4 · 2016-12-27T10:55:25+0000

If the marker is a single character, you can simply use the stack, which will give you the result in one pass:

 s = "Helllo\b\bo world" res = [] for c in s: if c == '\b': if res: del res[-1] else: res.append(c) print(''.join(res)) # Hello world

If the token is literally '<BckSp>' or some other line with a length greater than 1, you can use replace to replace it with '\b' and use the solution above. This only works if you know that '\b' does not occur at the input. If you cannot assign a replacement character, you can use split and process the results:

 s = 'Helllo<BckSp><BckSp>o world' res = [] for part in s.split('<BckSp>'): if res: del res[-1] res.extend(part) print(''.join(res)) # Hello world

anubhava · Answer 5 · 2016-12-27T11:01:21+0000

A bit verbose, but you can use this to count the number of <BckSp> and use subscript routines to get the final result.

 >>> bk = '<BckSp>' >>> s = 'Helllo<BckSp><BckSp>o world' >>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) Hello world >>> s = 'Helloo<BckSp> world' >>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) Hello world >>> s = 'Helloo<BckSp> worl<BckSp>d' >>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) Hello word >>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k' >>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) Hello work

Match as many repetitions of a character as repetitions of a captured group

More articles: