Python re.sub: ignore backlinks in replacement string

Question

Python re.sub: ignore backlinks in replacement string

I want to replace the template with a string. The string is specified in the variable. Of course, it may contain "\ 1", and it cannot be interpreted as a back link, but just like \ 1.

How can i achieve this?

+4

python python-3.x regex

max Dec 9 '11 at 6:43

source share

2 answers

Because of the comments, I thought about it and tried. Helped me a lot to increase my understanding of escaping, so I almost completely changed my answer so that it could be useful to later readers.

NullUserException gave you only a short version, I'm trying to explain it a bit more. And thanks to the critical reviews of Qtax and Duncan, this answer, I hope, is now correct and useful.

A backslash has a special meaning, its escape character in strings, that is, a backslash and the next character form an escape sequence that translates to something else when something is done with the string. This “something done” is already a string creation. Therefore, if you want to use \ literally, you need to avoid it. This escape character is a backslash.

So, start with a few examples to better understand what is going on. I additionally print ASCII codes of characters in a string, so I hope to increase the clarity of what is happening.

 s = "A\1\nB" print s print [x for x in s] print [hex(ord(x)) for x in s]

prints

 A B ['A', '\x01', '\n', 'B'] ['0x41', '0x1', '0xa', '0x42']

Therefore, while I entered the code \ and 1 in the code, s does not contain these two characters, it contains the ASCII character 0x01 , which is the "Beginning of the header". Same for \n , it translates to 0x0a character.

Since this behavior is not always necessary, raw strings can be used where escape sequences are ignored.

 s = r"A\1\nB" print s print [x for x in s] print [hex(ord(x)) for x in s]

I just added r before the line and now the result

 A\1\nB ['A', '\\', '1', '\\', 'n', 'B'] ['0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42']

All characters print when I print them.

This is the situation that we have. Now there is the following.

There may be a situation where a string must be passed to a regular expression, which must be found literally, so each character that has a special meaning in the regular expression (for example, + * $ [.) Must be escaped, so there is a special function re.escape that does the job.

But for this question, this is an incorrect function because the string should not be used in the regular expression, but as a replacement string for re.sub .

So, a new situation:

An raw string including escape sequences should be used as a replacement string for re.sub . re.sub will also handle escape sequences, but with a small but important difference in processing before: \n 0x0a character is still translated to 0x0a, but now the transition \1 has changed! It will be replaced by the contents of capture group 1 of the regular expression in re.sub .

 s = r"A\1\nB" print re.sub(r"(Replace)" ,s , "1 Replace 2")

And the result

 1 AReplace B 2

\1 been replaced by capture group content and \n the LineFeed character.

The important point is that you must understand this behavior, and now you have two possibilities for my opinion (and I will not judge which one is correct)

The creator is not sure about the behavior of the string, and if he enters \n , then he wants a new string. In this case, use this to simply exit \ followed by a digit.

 OnlyDigits = re.sub(r"(Replace)" ,re.sub(r"(\\)(?=\d)", r"\\\\", s) , "1 Replace 2") print OnlyDigits print [x for x in OnlyDigits] print [hex(ord(x)) for x in OnlyDigits

Output:

 1 A\1 B 2 ['1', ' ', 'A', '\\', '1', '\n', 'B', ' ', '2'] ['0x31', '0x20', '0x41', '0x5c', '0x31', '0xa', '0x42', '0x20', '0x32']

The creator determines exactly what he is doing, and if he wanted a new line, he would type \0xa . Avoid everyone in this case.

 All = re.sub(r"(Replace)" ,re.sub(r"(\\)", r"\\\\", s) , "1 Replace 2") print All print [x for x in All] print [hex(ord(x)) for x in All]

Output:

 1 A\1\nB 2 ['1', ' ', 'A', '\\', '1', '\\', 'n', 'B', ' ', '2'] ['0x31', '0x20', '0x41', '0x5c', '0x31', '0x5c', '0x6e', '0x42', '0x20', '0x32']

+5

stema Dec 9 '11 at 6:58

source share

Qtax · Accepted Answer · 2011-12-09T08:25:10+0000

The previous answer using re.escape() will disappear too much, and you will get unwanted backslashes in the replacement and replaced string.

It seems that in Python, only a backslash needs to be accelerated in a replacement string, so something like this might be enough:

 replacement = replacement.replace("\\", "\\\\")

An example :

 import re x = r'hai! \1 <ops> $1 \' \x \\' print "want to see: " print x print "getting: " print re.sub(".(.).", x, "###") print "over escaped: " print re.sub(".(.).", re.escape(x), "###") print "could work: " print re.sub(".(.).", x.replace("\\", "\\\\"), "###")

Output:

 want to see: hai! \1 <ops> $1 \' \x \\ getting: hai! # <ops> $1 \' \x \ over escaped: hai\!\ \1\ \<ops\>\ \$1\ \\'\ \x\ \\ could work: hai! \1 <ops> $1 \' \x \\

Python re.sub: ignore backlinks in replacement string

More articles: