A regular expression that finds and replaces non-ascii characters with Python

Question

A regular expression that finds and replaces non-ascii characters with Python

I need to change some non-ASCII characters to '_'. For instance,

Tannh ‰ user -> Tannh_user

If I use regex with Python, how can I do this?
Is there a better way to do this without using RE?

+3

python regex

prosseek May 03, '10 at 14:54

source share

5 answers

re.sub(r'[^\x00-\x7F]', '_', theString)

, theString unicode , ASCII 0 0x7F (-1, UTF-8 ..).

+7

interjay 03 '10 15:06

Using Python support for character encoding:

# coding: utf8
import codecs

def underscorereplace_errors(exc):
  return (u'_', exc.end)

codecs.register_error('underscorereplace', underscorereplace_errors)

print u'Tannh‰user'.encode('ascii', 'underscorereplace')

+5

Ignacio Vazquez-Abrams May 03 '10 at 15:16

source share

I would just call ordfor each character in the string, 1 to 1. If ord([char]) >= 128, the character is not an ascii character and needs to be replaced.

+2

Brian May 03, '10 at 15:13

source share

if you know which characters you want to replace, you can apply string methods

mystring.replace('oldchar', 'newchar')

+1

joaquin May 03, '10 at 15:05

source share

Messa · Accepted Answer · 2010-05-03T15:03:44+0000

How to do this using the built-in method str.decode:

>>> 'Tannh‰user'.decode('ascii', 'replace').replace(u'\ufffd', '_')
u'Tannh___user'

(You get a string unicode, so convert it to if necessary str).

unicode str, , ASCII, ASCII. , unicode.encode replace -ASCII '?', , ; . -.

, ord() , ASCII (0-127) - unicode str utf-8,

>>> s = u'Tannh‰user'
>>> "".join((c if ord(c) < 128 else '_' for c in s))
u'Tannh_user'

A regular expression that finds and replaces non-ascii characters with Python

More articles: