Remove non-ASCII characters from pandas column

I tried to work on this problem a bit. I am trying to remove DB_user columns without ASCII characters and try to replace them with spaces. But I get some errors all the time. Here's what my data frame looks like:

 + ------------------------------------------------- ----------
 |  DB_user source count |                                             
 + ------------------------------------------------- ----------
 |  ??? / "Ò | Z?)?] ?? C% ?? JA 10 |                                       
 |  ? D $ ZGU; @D ?? _ ??? T (?) B 3 |                                       
 |  ? Q`H ?? M '? Y ?? KTK $? Ù ‹??? Ð © JL4 ?? *? _ ??  C 2 |                                        
 + ------------------------------------------------- ----------

I used this function that I came across while investigating a problem on SO.

def filter_func(string): for i in range(0,len(string)): if (ord(string[i])< 32 or ord(string[i])>126 break return '' And then using the apply function: df['DB_user'] = df.apply(filter_func,axis=1) 

I keep getting the error:

 'ord () expected a character, but string of length 66 found', u'occurred at index 2 '

However, I thought using the loop in the filter_func function, I dealt with this by injecting char into 'ord'. Therefore, the moment it hits a non-ASCII character, it should be replaced by a space.

Can anyone help me out?

Thanks!

+9
source share
5 answers

You are mistaken if you do not apply it for each character, you apply it for each word and ord errors, since it takes one character, you need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x])) 

You can also simplify the connection using chain comparison:

  ''.join([i if 32 < ord(i) < 126 else " " for i in x]) 

You can also use string.printable to filter characters:

 from string import printable st = set(printable) df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x])) 

The fastest way is to translate:

 from string import maketrans del_chars = " ".join(chr(i) for i in range(32) + range(127, 256)) trans = maketrans(t, " "*len(del_chars)) df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans)) 

Interestingly, this is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans) 
+5
source

you can try the following:

 df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True) 
+19
source

The usual trick is to perform ASCII encoding with the errors="ignore" flag, and then decode it in ASCII:

 df['DB_user'].str.encode('ascii', 'ignore').str.decode('ascii') 

Starting with python3.x and above, this is my recommended solution.


Minimum Code Example

 s = pd.Series(['Déjà vu', 'Ò|zz', ';test 123']) s 0 Déjà vu 1 Ò|zz 2 ;test 123 dtype: object s.str.encode('ascii', 'ignore').str.decode('ascii') 0 Dj vu 1 |zz 2 ;test 123 dtype: object 

PS: This can also be extended to cases where you need to filter out characters that do not belong to any character encoding scheme (not just ASCII).

+1
source

A few of the answers given here are incorrect. Simple check:

 s = pd.Series([chr(x) for x in range(256)]) s.loc[0] >> '\x00' s.replace({r'[^\x00-\x7F]+':''}, regex=True).loc[0] >> '\x00' # FAIL s.str.encode('ascii', 'ignore').str.decode('ascii').loc[0] >> '\x00' # FAIL s.apply(lambda x: ''.join([i if 32 < ord(i) < 126 else " " for i in x])).loc[0] >> ' ' # Success! import string s.apply(lambda x: ''.join([" " if i not in string.printable else i for i in x])).loc[0] >> ' ' # Looks good, but... s.apply(lambda x: ''.join([" " if i not in string.printable else i for i in x])).loc[11] >> '\x0b' # FAIL del_chars = " ".join([chr(i) for i in list(range(32)) + list(range(127, 256))]) trans = str.maketrans(del_chars, " " * len(del_chars)) s.apply(lambda x: x.translate(trans)).loc[11] >> ' ' # Success! 

Conclusion: only the options in the accepted answer (from Padrake Cunningham) work reliably. The second answer, corrected here, has some weird errors / typos in Python, but otherwise it should be the fastest.

0
source

This worked for me:

 import re def replace_foreign_characters(s): return re.sub(r'[^\x00-\x7f]',r'', s) df['column_name'] = df['column_name'].apply(lambda x: replace_foreign_characters(x)) 
-1
source

Source: https://habr.com/ru/post/1246178/


All Articles