Remove non-ASCII characters from pandas column

Question

Remove non-ASCII characters from pandas column

I tried to work on this problem a bit. I am trying to remove DB_user columns without ASCII characters and try to replace them with spaces. But I get some errors all the time. Here's what my data frame looks like:

 + ------------------------------------------------- ----------
 |  DB_user source count |                                             
 + ------------------------------------------------- ----------
 |  ??? / "Ò | Z?)?] ?? C% ?? JA 10 |                                       
 |  ? D $ ZGU; @D ?? _ ??? T (?) B 3 |                                       
 |  ? Q`H ?? M '? Y ?? KTK $? Ù ‹??? Ð © JL4 ?? *? _ ??  C 2 |                                        
 + ------------------------------------------------- ----------

I used this function that I came across while investigating a problem on SO.

def filter_func(string): for i in range(0,len(string)): if (ord(string[i])< 32 or ord(string[i])>126 break return '' And then using the apply function: df['DB_user'] = df.apply(filter_func,axis=1)

I keep getting the error:

 'ord () expected a character, but string of length 66 found', u'occurred at index 2 '

However, I thought using the loop in the filter_func function, I dealt with this by injecting char into 'ord'. Therefore, the moment it hits a non-ASCII character, it should be replaced by a space.

Can anyone help me out?

Thanks!

+9

python string pandas character-encoding

red_devil Mar 31 '16 at 18:08

source share

5 answers

you can try the following:

 df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

+19

Maxu Mar 31 '16 at 18:21

source share

The usual trick is to perform ASCII encoding with the errors="ignore" flag, and then decode it in ASCII:

 df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii')

Starting with python3.x and above, this is my recommended solution.

Minimum Code Example

 s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123']) s 0 Déjà vu 1 Ò|zz 2 ;test 123 dtype: object s.str.encode('ascii', 'ignore').str.decode('ascii') 0 Dj vu 1 |zz 2 ;test 123 dtype: object

PS: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

+1

cs95 Jun 24 '19 at 22:48

source share

A few of the answers given here are incorrect. Simple check:

 s = pd.Series([chr(x) for x in range(256)]) s.loc[0] >> '\x00' s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0] >> '\x00' # FAIL s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0] >> '\x00' # FAIL s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0] >> ' ' # Success! import string s.apply(lambda x: ''.join([" " if i not in string.printable else i for i in x])).loc[0] >> ' ' # Looks good, but... s.apply(lambda x: ''.join([" " if i not in string.printable else i for i in x])).loc[11] >> '\x0b' # FAIL del_chars = " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))]) trans = str.maketrans(del_chars, " " * len(del_chars)) s.apply(lambda x: x.translate(trans)).loc[11] >> ' ' # Success!

Conclusion: only the options in the accepted answer (from Padrake Cunningham) work reliably. The second answer, corrected here, has some weird errors / typos in Python, but otherwise it should be the fastest.

0

Josh friedlander Aug 08 '19 at 21:20

source share

This worked for me:

 import re def replace_foreign_characters(s): return re.sub(r'[^\x00-\x7f]',r'', s) df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x))

-1

Justin malinchak Feb 27 '19 at 22:12

source share

Padraic cunningham · Accepted Answer · 2016-03-31T18:31:13+0000

You are mistaken if you do not apply it for each character, you apply it for each word and ord errors, since it takes one character, you need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the connection using chain comparison:

  ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You can also use string.printable to filter characters:

 from string import printable st = set(printable) df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))

The fastest way is to translate:

 from string import maketrans del_chars = " ".join(chr(i) for i in range(32) + range(127, 256)) trans = maketrans(t, " "*len(del_chars)) df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly, this is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)

Remove non-ASCII characters from pandas column

More articles: