Pandas read_csv dtype output

Question

Pandas read_csv dtype output

I have a file with a shared tab with a column that should be interpreted as a string, but many of the entries are integers. With small files, read_csv correctly interprets the column as a string after looking at some non-integer values, but this does not work with large files:

import pandas as pd df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000}) df.to_csv('test', sep='\t', index=False, na_rep='NA') df2 = pd.read_csv('test', sep='\t') print df2['a'].unique() for a in df2['a'][262140:262150]: print repr(a)

output:

 ['1' 'X' 1] '1' '1' '1' '1' 1 1 1 1 1 1

Interestingly, 262144 is Strength 2, so I think the output and conversion happens in chunks, but some chunks miss.

I'm pretty sure this is a mistake, but I need a job that may use quoting, although adding quoting = csv.QUOTE_NONNUMERIC for reading and writing does not fix the problem. Ideally, I could get around this by specifying my string data and somehow force pandas not to draw any conclusions on the cited data.

Using pandas 0.12.0

+6

python pandas parsing

andrew Aug 27 '13 at 17:25

source share

2 answers

To avoid Pandas deriving its data type, specify the converters argument for read_csv :

converters : dict. optional
Dict functions to convert values to specific columns. Keys can be integers or columns

For your file, it will look like this:

 df2 = pd.read_csv('test', sep='\t', converters={'a':str})

My reading of documents is that you do not need to specify converters for each column. Pandas must continue to display the data type of unspecified columns.

+6

Steven rumbalski Aug 27 '13 at 18:05

source share

Andy hayden · Accepted Answer · 2013-08-27T17:40:44+0000

You tricked the read_csv parser here (and to be honest, I don’t think you can always expect it to display correctly regardless of what you throw on it), but yes, it could be a mistake !

As @Steven points out, you can use the read_csv converter argument :

 df2 = pd.read_csv('test', sep='\t', converters={'a': str})

The lazy solution is to simply fix this after you read in the file:

 In [11]: df2['a'] = df2['a'].astype('str') # now they are equal In [12]: pd.util.testing.assert_frame_equal(df, df2)

Note. If you are looking for a DataFrames storage solution , for example. between sessions, both pickle and HDF5Store are excellent solutions that will not be affected by these types of parsing errors (and will be much faster). See: How to store a data frame using PANDAS, Python

Pandas read_csv dtype output

More articles: