Pandas IndexError / TypeError mismatch with NaN values

I have several rows of variable length lists with some zeros. For instance:

In [108]: s0 = pd.Series([['a', 'b'],['c'],np.nan]) In [109]: s0 Out[109]: 0 [a, b] 1 [c] 2 NaN dtype: object 

but the other contains all NaNs :

 In [110]: s1 = pd.Series([np.nan,np.nan]) In [111]: s1 Out[111]: 0 NaN 1 NaN dtype: float64 

I need the last item in each list, which is simple:

 In [112]: s0.map(lambda x: x[-1] if isinstance(x,list) else x) Out[112]: 0 b 1 c 2 NaN dtype: object 

But until I realized that without isinstance , when indexing throttles on NaNs , it does it differently on s0 and s1 :

 In [113]: s0.map(lambda x: x[-1]) ... TypeError: 'float' object is not subscriptable In [114]: s1.map(lamda x: x[-1]) ... IndexError: invalid index to scalar variable. 

Can someone explain why? This is mistake? I am using Pandas 0.16.2 and Python 3.4.3.

+5
source share
1 answer

At its core, this is really a NumPy problem, not a pandas problem.

map iterates over the values ​​in the column to pass them to the lambda functions one at a time. Under the columns / rows in pandas, there are only (slices) of NumPy arrays, so pandas defines the following helper function to get the value from the base array for the function. This is called map at each iteration:

 PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) { char *item = (char *) PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0); return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*) ap); } 

The key bit is PyArray_Scalar , which is a NumPy API function that copies a section of a NumPy array to return a scalar value.

The code that makes up this function is too long to post here, but here where it can be found in the codebase. All we need to know is that the scalar it returns will match the dtype of the array used.

Back to your series: s0 has an object dtype, and s1 has a float64 dtype. This means that PyArray_Scalar returns a different type of scalar for each series; the actual Python float and scalar NumPy respectively:

 >>> type(s0[2]) float >>> type(s1[0]) numpy.float64 

NaN values ​​are returned as two different types, so different errors when trying to index them using the lambda function.

+1
source

Source: https://habr.com/ru/post/1238844/


All Articles