A regular expression that finds and replaces non-ascii characters with Python

I need to change some non-ASCII characters to '_'. For instance,

Tannh ‰ user -> Tannh_user
  • If I use regex with Python, how can I do this?
  • Is there a better way to do this without using RE?
+3
source share
5 answers

How to do this using the built-in method str.decode:

>>> 'Tannh‰user'.decode('ascii', 'replace').replace(u'\ufffd', '_')
u'Tannh___user'

(You get a string unicode, so convert it to if necessary str).

unicode str, , ASCII, ASCII. , unicode.encode replace -ASCII '?', , ; . -.

, ord() , ASCII (0-127) - unicode str utf-8,

>>> s = u'Tannh‰user'
>>> "".join((c if ord(c) < 128 else '_' for c in s))
u'Tannh_user'
+5
re.sub(r'[^\x00-\x7F]', '_', theString)

, theString unicode ​​, ASCII 0 0x7F (-1, UTF-8 ..).

+7

Using Python support for character encoding:

# coding: utf8
import codecs

def underscorereplace_errors(exc):
  return (u'_', exc.end)

codecs.register_error('underscorereplace', underscorereplace_errors)

print u'Tannh‰user'.encode('ascii', 'underscorereplace')
+5
source

I would just call ordfor each character in the string, 1 to 1. If ord([char]) >= 128, the character is not an ascii character and needs to be replaced.

+2
source

if you know which characters you want to replace, you can apply string methods

mystring.replace('oldchar', 'newchar')
+1
source

Source: https://habr.com/ru/post/1743834/


All Articles