Pandas read_csv ignoring dtypes columns when passing skip_footer arg

Question

Pandas read_csv ignoring dtypes columns when passing skip_footer arg

When I try to import a csv file into a dataframe, pandas (0.13.1) ignores the dtype parameter. Is there a way to stop pandas from outputting a data type on its own?

I merge several CSV files, and sometimes the client contains letters and pandas import as a string. When I try to combine two data frames, I get an error because I try to combine two different types. I need everything that is stored as strings.

Data Snapshot:

|WAREHOUSE|ERROR|CUSTOMER|ORDER NO| |---------|-----|--------|--------| |3615 | |03106 |253734 | |3615 | |03156 |290550 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 |

Import Line:

 df = pd.read_csv("SomeFile.csv", header=1, skip_footer=1, usecols=[2, 3], dtype={'ORDER NO': str, 'CUSTOMER': str})

df.dtypes outputs this:

 ORDER NO int64 CUSTOMER int64 dtype: object

+6

python python-2.7 pandas csv

Rister Jul 15 '14 at 14:40

source share

2 answers

Unfortunately, using converters or newer versions of pandas does not solve the more general problem of always ensuring that read_csv does not output type float64. With pandas 0.15.2, the following example with a CSV containing integers in hexadecimal notation with NULL records shows that using converters for which the name is implied, they should be used, obstruct the dtype specification.

 In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None])) In [2]: df.to_csv("H:/tmp.csv", index = False) In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"}) In [4]: ef.dtypes.map(lambda x: x) Out[4]: a int64 b float64 c float64 d object dtype: object

The specified object dtype applies only to the all-NULL column. In this case, the float64 values can simply be converted to integers, but according to the pigeon principle, not all 64-bit integers can be represented as float64.

The best solution I found for this more general case is to get pandas to read the potentially problematic columns as rows, as already discussed, and then convert the slice with values that need conversion (and not display the transformation in the column, since this will again cause dtype = float64 to exit automatically).

 In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"}) In [6]: ff.dtypes Out[6]: a int64 b object c object d object dtype: object In [7]: for c in "bc": .....: ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16)) .....: In [8]: ff.dtypes Out[8]: a int64 b object c object d object dtype: object In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index] Out[9]: [(255, numpy.int64), (254, numpy.int64), (253L, long), (nan, float), (nan, float), (252L, long), (None, NoneType), (None, NoneType)]

As far as I was able to determine, at least prior to version 0.15.2, there is no way to avoid post-processing of string values in such situations.

+6

Rune lyngsoe Jul 24 '15 at 11:58

source share

Rister · Accepted Answer · 2014-07-16T12:50:02+0000

Pandas 0.13.1 silently ignores the dtype argument because c engine does not support skip_footer . This caused Pandas to return to a python engine that does not support dtype .

Decision? Use converters

 df = pd.read_csv('SomeFile.csv', header=1, skip_footer=1, usecols=[2, 3], converters={'CUSTOMER': str, 'ORDER NO': str}, engine='python')

Output:

 In [1]: df.dtypes Out[2]: CUSTOMER object ORDER NO object dtype: object In [3]: type(df['CUSTOMER'][0]) Out[4]: str In [5]: df.head() Out[6]: CUSTOMER ORDER NO 0 03106 253734 1 03156 290550 2 03175 262207 3 03175 262207 4 03175 262207

The leading 0 from the source file is saved and all data is stored as strings.

Pandas read_csv ignoring dtypes columns when passing skip_footer arg

More articles: