Pandas read_csv dtype output

I have a file with a shared tab with a column that should be interpreted as a string, but many of the entries are integers. With small files, read_csv correctly interprets the column as a string after looking at some non-integer values, but this does not work with large files:

import pandas as pd df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000}) df.to_csv('test', sep='\t', index=False, na_rep='NA') df2 = pd.read_csv('test', sep='\t') print df2['a'].unique() for a in df2['a'][262140:262150]: print repr(a) 

output:

 ['1' 'X' 1] '1' '1' '1' '1' 1 1 1 1 1 1 

Interestingly, 262144 is Strength 2, so I think the output and conversion happens in chunks, but some chunks miss.

I'm pretty sure this is a mistake, but I need a job that may use quoting, although adding quoting = csv.QUOTE_NONNUMERIC for reading and writing does not fix the problem. Ideally, I could get around this by specifying my string data and somehow force pandas not to draw any conclusions on the cited data.

Using pandas 0.12.0

+6
source share
2 answers

You tricked the read_csv parser here (and to be honest, I donโ€™t think you can always expect it to display correctly regardless of what you throw on it), but yes, it could be a mistake !

As @Steven points out, you can use the read_csv converter argument :

 df2 = pd.read_csv('test', sep='\t', converters={'a': str}) 

The lazy solution is to simply fix this after you read in the file:

 In [11]: df2['a'] = df2['a'].astype('str') # now they are equal In [12]: pd.util.testing.assert_frame_equal(df, df2) 

Note. If you are looking for a DataFrames storage solution , for example. between sessions, both pickle and HDF5Store are excellent solutions that will not be affected by these types of parsing errors (and will be much faster). See: How to store a data frame using PANDAS, Python

+5
source

To avoid Pandas deriving its data type, specify the converters argument for read_csv :

converters : dict. optional

Dict functions to convert values โ€‹โ€‹to specific columns. Keys can be integers or columns

For your file, it will look like this:

 df2 = pd.read_csv('test', sep='\t', converters={'a':str}) 

My reading of documents is that you do not need to specify converters for each column. Pandas must continue to display the data type of unspecified columns.

+6
source

Source: https://habr.com/ru/post/952612/


All Articles