Pandas declarative style data processing

I have a pandas tv channel of vehicle coordinates (from several vehicles in a few days). For each car and for every day I do two things: either apply the algorithm to it, or completely filter it from the data set if it does not meet certain criteria.

To do this, I use df.groupby('vehicle_id', 'day') , and then .apply(algorithm) or .filter(condition) , where algorithm and condition are the functions that are taken in the data frame.

I would like the full processing of my dataset (which includes several .apply and .filter ) that should be written in a declarative style, as opposed to an imperative loop through groups, with the goal of just looking at something like:

df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)

Of course, the code above is incorrect, because .apply() and .filter() returning new data, and this is just my problem. They return all the data back to one data frame, and I find that I use .groupby('vehicle_id', 'day') continuously.

Is there a good way that I can write this without having to group the same columns over and over?

+5
source share
1 answer

Since apply uses a for loop anyway (which means there are no complex optimizations in the background), I suggest using the actual for loop:

 arr = [] for key, dfg in df.groupby(['vehicle_id', 'day']): dfg = dfg.do_stuff1() # Perform all needed operations dfg = do_stuff2(dfg) # arr.append(dfg) result = pd.concat(arr) 

An alternative is to create a function that runs everything that is applied and filters sequentially on a specific data frame, and then displays one group / applies to it:

 def all_operations(dfg): # Do stuff return result_df result = df.group_by(['vehicle_id', 'day']).apply(all_operations) 

In both cases, you will have to deal with cases where an empty data filter is returned from the filters, if such cases exist.

0
source

Source: https://habr.com/ru/post/1269294/


All Articles