How to read UTF-8 files using Pandas?

Question

How to read UTF-8 files using Pandas?

I have a UTF-8 file with twitter data, and I'm trying to read it in a Python data frame, but I can only get the type "object" instead of the unicode lines:

# file 1459966468_324.csv #1459966468_324.csv: UTF-8 Unicode English text df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode}) df.dtypes text object Airline object name object retweet_count float64 sentiment object tweet_location object dtype: object

What is the correct way to read and force UTF-8 data into unicode using Pandas?

This does not solve the problem:

 df = pd.read_csv('1459966468_324.csv', encoding = 'utf8') df.apply(lambda x: pd.lib.infer_dtype(x.values))

The text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

+17

python pandas csv utf-8

Isstvan Apr 6 '16 at 21:39

source share

3 answers

Use the encoding keyword with the appropriate parameter:

 df = pd.read_csv('1459966468_324.csv', encoding='utf8')

+4

Stefan Apr 6 '16 at 21:43

source share

Pandas stores strings in object s. In python 3, all lines in unicode are by default. Therefore, if you are using python 3, your data is already in Unicode (do not be misled by the type of object ).

If you have python 2, use df = pd.read_csv('your_file', encoding = 'utf8') . Then try, for example, pd.lib.infer_dtype(df.iloc[0,0]) (I assume that the first col consists of strings.)

+1

ptrj Apr 7 '16 at 0:21

source share

Sam · Accepted Answer · 2016-04-06T21:46:26+0000

As mentioned in another poster, you can try:

 df = pd.read_csv('1459966468_324.csv', encoding='utf8')

However, this may leave you looking at the "object" when printing dtypes. To confirm that they are utf8, try this line after reading the CSV:

 df.apply(lambda x: pd.lib.infer_dtype(x.values))

Output Example:

 args unicode date datetime64 host unicode kwargs unicode operation unicode

How to read UTF-8 files using Pandas?

More articles: