Pandas perform an additional performance task

I have a framework with several indexes (Date, InputTime), and this framework may contain some NA values ​​in columns (Value, Id). I want to fill in the forward value, but only by date, and in any case, I do not find this in a very effective way.

Here is the data type that I have:

Dataframe example

And here is the result I want:

Dataframe properly fill forward by date only

So, to populate the date correctly, I can use the groupby function (level = 0). Grouping is fast, but the fill function applies to a data group of data by date, which is really too slow.

, ( , ) ( , ).

import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"

Results

- , ?

+4
1

github/jreback: # 7895..ffill cython groupby ( , , ), python . . url: https://github.com/pandas-dev/pandas/issues/11296

jreback, groupby ffill(), , cumsum() . :

df = df.sort_index()
df.ffill() * (1 - df.isnull().astype(int)).groupby(level=0).cumsum().applymap(lambda x: None if x == 0 else 1)
0

Source: https://habr.com/ru/post/1610794/


All Articles