Python replaces Unicode characters

I wrote a program to read in the Windows DNS debug log, but inside I always got some funny characters in the domain field.

The following is one example:

(13) \ xc2 \ xb5 \ xc2 \ xb1 \ xc2 \ xbe \ xc3 \ xa2p \ xc3 \ xb4 \ xc2 \ x8d (5) example (3) com (0) '

I want to replace everything \x..with?

I explicitly type \ xc2 as follows

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'

But it does not work if I write as follows:

re.sub('\\\x..', '?', line)

How can I write a regex to replace them?

+4
source share
2 answers

There are more effective tools for this task than regular expression, you can try, for example:

>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'

, ascii. "?" :

>>> print line.decode('ascii', 'replace')
(13)        p    (5)example(3)com(0)

, / , mojibake , , .

emojibake . , , .

+2

?

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

pattern = r'\\x.+'
re.sub(pattern, r'?', line)
-2

Source: https://habr.com/ru/post/1656111/


All Articles