Pandas IndexError / TypeError mismatch with NaN values

Question

Pandas IndexError / TypeError mismatch with NaN values

I have several rows of variable length lists with some zeros. For instance:

In [108]: s0 = pd.Series([['a', 'b'],['c'],np.nan]) In [109]: s0 Out[109]: 0 [a, b] 1 [c] 2 NaN dtype: object

but the other contains all NaNs :

 In [110]: s1 = pd.Series([np.nan,np.nan]) In [111]: s1 Out[111]: 0 NaN 1 NaN dtype: float64

I need the last item in each list, which is simple:

 In [112]: s0.map(lambda x: x[-1] if isinstance(x,list) else x) Out[112]: 0 b 1 c 2 NaN dtype: object

But until I realized that without isinstance , when indexing throttles on NaNs , it does it differently on s0 and s1 :

 In [113]: s0.map(lambda x: x[-1]) ... TypeError: 'float' object is not subscriptable In [114]: s1.map(lamda x: x[-1]) ... IndexError: invalid index to scalar variable.

Can someone explain why? This is mistake? I am using Pandas 0.16.2 and Python 3.4.3.

+5

python pandas exception indexing nan

majr Dec 21 '15 at 14:29

source share

1 answer

Alex Riley · Accepted Answer · 2015-12-21T17:50:53+0000

At its core, this is really a NumPy problem, not a pandas problem.

map iterates over the values in the column to pass them to the lambda functions one at a time. Under the columns / rows in pandas, there are only (slices) of NumPy arrays, so pandas defines the following helper function to get the value from the base array for the function. This is called map at each iteration:

 PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) { char *item = (char *) PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0); return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*) ap); }

The key bit is PyArray_Scalar , which is a NumPy API function that copies a section of a NumPy array to return a scalar value.

The code that makes up this function is too long to post here, but here where it can be found in the codebase. All we need to know is that the scalar it returns will match the dtype of the array used.

Back to your series: s0 has an object dtype, and s1 has a float64 dtype. This means that PyArray_Scalar returns a different type of scalar for each series; the actual Python float and scalar NumPy respectively:

 >>> type(s0[2]) float >>> type(s1[0]) numpy.float64

NaN values are returned as two different types, so different errors when trying to index them using the lambda function.

Pandas IndexError / TypeError mismatch with NaN values

More articles: