Pandas: error line layout

I am new to Pandas and trying to figure out where my code breaks. Say I'm doing a type conversion:

df['x']=df['x'].astype('int') 

... and I get the error message "ValueError: invalid literal for long () with base 10: '1.0692e + 06'

In general, if I have 1000 records in a data framework, how can I find out which record causes a break. Is there anything in ipdb to display the current location (i.e. where the code broke)? Basically, I am trying to determine which value cannot be converted to Int.

+10
source share
2 answers

The error you see may be caused by the value (s) in column x , which is the row:

 In [15]: df = pd.DataFrame({'x':['1.0692e+06']}) In [16]: df['x'].astype('int') ValueError: invalid literal for long() with base 10: '1.0692e+06' 

Ideally, the problem can be avoided by making sure that the values ​​stored in the DataFrame are no longer strings when building the DataFrame. How to do this, of course, depends on how you create the DataFrame.

After the DataFrame could be set using applymap:

 import ast df = df.applymap(ast.literal_eval).astype('int') 

but calling ast.literal_eval for each value in the DataFrame can be slow, so the best option is to fix the problem from the start.


You can usually drop it to the debugger when an exception occurs to check for the problematic row value.

However, in this case, the exception occurs inside the astype call, which is a thin shell around the C-compiled code. The compiled code loops through the values ​​in df['x'] , so the Python debugger does not help here - it will not let you understand what value causes an exception from C-compiled code.

There are many important parts of Pandas and NumPy written in C, C ++, Cython or Fortran, and the Python debugger will not accept you inside those pieces of code that are not Python, where fast loops are processed.

So instead, I will return to a low-brow solution: iterate over the values ​​in a Python loop and use try...except to catch the first error:

 df = pd.DataFrame({'x':['1.0692e+06']}) for i, item in enumerate(df['x']): try: int(item) except ValueError: print('ERROR at index {}: {!r}'.format(i, item)) 

gives

 ERROR at index 0: '1.0692e+06' 
+11
source

To report all rows that could not be displayed due to some exception:

 df.apply(my_function) # throws various exceptions at unknown rows # print Exceptions, index, and row content for i, row in enumerate(df): try: my_function(row) except Exception as e: print('Error at index {}: {!r}'.format(i, row)) print(e) 
0
source

Source: https://habr.com/ru/post/977496/


All Articles