I have a file with a shared tab with a column that should be interpreted as a string, but many of the entries are integers. With small files, read_csv correctly interprets the column as a string after looking at some non-integer values, but this does not work with large files:
import pandas as pd df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000}) df.to_csv('test', sep='\t', index=False, na_rep='NA') df2 = pd.read_csv('test', sep='\t') print df2['a'].unique() for a in df2['a'][262140:262150]: print repr(a)
output:
['1' 'X' 1] '1' '1' '1' '1' 1 1 1 1 1 1
Interestingly, 262144 is Strength 2, so I think the output and conversion happens in chunks, but some chunks miss.
I'm pretty sure this is a mistake, but I need a job that may use quoting, although adding quoting = csv.QUOTE_NONNUMERIC for reading and writing does not fix the problem. Ideally, I could get around this by specifying my string data and somehow force pandas not to draw any conclusions on the cited data.
Using pandas 0.12.0
source share