How to read UTF-8 files using Pandas?

I have a UTF-8 file with twitter data, and I'm trying to read it in a Python data frame, but I can only get the type "object" instead of the unicode lines:

# file 1459966468_324.csv #1459966468_324.csv: UTF-8 Unicode English text df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode}) df.dtypes text object Airline object name object retweet_count float64 sentiment object tweet_location object dtype: object 

What is the correct way to read and force UTF-8 data into unicode using Pandas?

This does not solve the problem:

 df = pd.read_csv('1459966468_324.csv', encoding = 'utf8') df.apply(lambda x: pd.lib.infer_dtype(x.values)) 

The text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

+17
source share
3 answers

As mentioned in another poster, you can try:

 df = pd.read_csv('1459966468_324.csv', encoding='utf8') 

However, this may leave you looking at the "object" when printing dtypes. To confirm that they are utf8, try this line after reading the CSV:

 df.apply(lambda x: pd.lib.infer_dtype(x.values)) 

Output Example:

 args unicode date datetime64 host unicode kwargs unicode operation unicode 
+17
source

Use the encoding keyword with the appropriate parameter:

 df = pd.read_csv('1459966468_324.csv', encoding='utf8') 
+4
source

Pandas stores strings in object s. In python 3, all lines in unicode are by default. Therefore, if you are using python 3, your data is already in Unicode (do not be misled by the type of object ).

If you have python 2, use df = pd.read_csv('your_file', encoding = 'utf8') . Then try, for example, pd.lib.infer_dtype(df.iloc[0,0]) (I assume that the first col consists of strings.)

+1
source

Source: https://habr.com/ru/post/1246591/


All Articles