Pandas read_csv ignoring dtypes columns when passing skip_footer arg

When I try to import a csv file into a dataframe, pandas (0.13.1) ignores the dtype parameter. Is there a way to stop pandas from outputting a data type on its own?

I merge several CSV files, and sometimes the client contains letters and pandas import as a string. When I try to combine two data frames, I get an error because I try to combine two different types. I need everything that is stored as strings.

Data Snapshot:

|WAREHOUSE|ERROR|CUSTOMER|ORDER NO| |---------|-----|--------|--------| |3615 | |03106 |253734 | |3615 | |03156 |290550 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | |3615 | |03175 |262207 | 

Import Line:

 df = pd.read_csv("SomeFile.csv", header=1, skip_footer=1, usecols=[2, 3], dtype={'ORDER NO': str, 'CUSTOMER': str}) 

df.dtypes outputs this:

 ORDER NO int64 CUSTOMER int64 dtype: object 
+6
source share
2 answers

Pandas 0.13.1 silently ignores the dtype argument because c engine does not support skip_footer . This caused Pandas to return to a python engine that does not support dtype .

Decision? Use converters

 df = pd.read_csv('SomeFile.csv', header=1, skip_footer=1, usecols=[2, 3], converters={'CUSTOMER': str, 'ORDER NO': str}, engine='python') 

Output:

 In [1]: df.dtypes Out[2]: CUSTOMER object ORDER NO object dtype: object In [3]: type(df['CUSTOMER'][0]) Out[4]: str In [5]: df.head() Out[6]: CUSTOMER ORDER NO 0 03106 253734 1 03156 290550 2 03175 262207 3 03175 262207 4 03175 262207 

The leading 0 from the source file is saved and all data is stored as strings.

+15
source

Unfortunately, using converters or newer versions of pandas does not solve the more general problem of always ensuring that read_csv does not output type float64. With pandas 0.15.2, the following example with a CSV containing integers in hexadecimal notation with NULL records shows that using converters for which the name is implied, they should be used, obstruct the dtype specification.

 In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None])) In [2]: df.to_csv("H:/tmp.csv", index = False) In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"}) In [4]: ef.dtypes.map(lambda x: x) Out[4]: a int64 b float64 c float64 d object dtype: object 

The specified object dtype applies only to the all-NULL column. In this case, the float64 values ​​can simply be converted to integers, but according to the pigeon principle, not all 64-bit integers can be represented as float64.

The best solution I found for this more general case is to get pandas to read the potentially problematic columns as rows, as already discussed, and then convert the slice with values ​​that need conversion (and not display the transformation in the column, since this will again cause dtype = float64 to exit automatically).

 In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"}) In [6]: ff.dtypes Out[6]: a int64 b object c object d object dtype: object In [7]: for c in "bc": .....: ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16)) .....: In [8]: ff.dtypes Out[8]: a int64 b object c object d object dtype: object In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index] Out[9]: [(255, numpy.int64), (254, numpy.int64), (253L, long), (nan, float), (nan, float), (252L, long), (None, NoneType), (None, NoneType)] 

As far as I was able to determine, at least prior to version 0.15.2, there is no way to avoid post-processing of string values ​​in such situations.

+6
source

Source: https://habr.com/ru/post/972255/


All Articles