Python - replace non-ascii character in string (")

Question

Python - replace non-ascii character in string (")

I need to replace the "" character in the string with a space, but I still get the error message. This is the code I'm using:

# -*- coding: utf-8 -*- from bs4 import BeautifulSoup # other code soup = BeautifulSoup(data, 'lxml') mystring = soup.find('a').text.replace(' »','')

UnicodeEncodeError: codec 'ascii' cannot encode character u '\ xbb' at position 13: serial number not in range (128)

But if I check it with this other script:

 # -*- coding: utf-8 -*- a = "hi »" b = a.replace('»','')

It works. Why is this?

+6

python string regex encoding decoding

Hyperion Nov 29 '16 at 17:30

source share

2 answers

@Moinuddin Quadri's answer is better suited to your use case, but in general a simple way to remove non-ASCII characters from a given string is to do the following:

 # the characters '¡' and '¢' are non-ASCII string = "hello, my name is ¢arl... ¡Hola!" all_ascii = ''.join(char for char in string if ord(char) < 128)

This leads to:

 >>> print(all_ascii) "hello, my name is arl... Hola!"

You can also do this:

 ''.join(filter(lambda c: ord(c) < 128, string))

But this is about 30% slower than the char for char ... approach.

+2

blacksite Nov 29 '16 at 17:42

source share

Moinuddin quadri · Accepted Answer · 2016-11-29T17:37:16+0000

To replace the contents of a string using the str.replace() method; you need to decode the string first, then replace the text and encode it back to the source text:

 >>> a = "hi »" >>> a.decode('utf-8').replace("»".decode('utf-8'), "").encode('utf-8') 'hi '

You can also use the following regular expression to remove all non-ascii characters from a string:

 >>> import re >>> re.sub(r'[^\x00-\x7f]',r'', 'hi »') 'hi '

Python - replace non-ascii character in string (")

More articles: