Pandas: a complex condition in datetime

I have a dataframe with a date type column and a float type column.

                     date value
0 2010-01-01 01:23:00 21.2
1 2010-01-02 01:33:00 63.4
2 2010-01-03 06:02:00 80.6
3 2010-01-04 06:05:00 50.1
4 2010-01-05 06:20:00 346.5
5 2010-01-06 07:44:00 111.8
6 2010-01-07 08:00:00 113.1
7 2010-01-08 08:22:00 10.6
8 2010-01-09 09:00:00 287.2
9 2010-01-10 09:14:00 1652.6

I want to create a new column to record the average one hour before the current time of the iteration row.

[UPDATE] Example :

If the current iteration 4 2010-01-05 06:20:00 346.5, I need to calculate (50.1 + 80.6) / 2(the value is in the range 2010-01-05 05:20:00~2010-01-05 06:20:00and calculate the average value).

                     date value before_1hr_mean
4 2010-01-05 06:20:00 346.5 65.35

I use iterrows()to solve this problem as the following code. But this method is very slow, and the function is iterrows()usually not recommended in pandas, and this line will become

[UPDATE]

df['before_1hr_mean'] = np.nan
for index, row in df.iterrows():
   df.loc[index, 'before_1hr_mean'] = df[(df['date'] < row['date']) & \
       (df['date'] >= row['date'] - pd.Timedelta(hours=1))]['value'].mean()

Is there a better way to handle this situation?

+4
source share
1 answer

I took the liberty of changing my data to make it on the same day. This is the only way to understand your question.

df.join(
    df.set_index('date').value.rolling('H').mean().rename('before_1hr_mean'),
    on='date'
)

                 date   value  before_1hr_mean
0 2010-01-01 01:23:00    21.2        21.200000
1 2010-01-01 01:33:00    63.4        42.300000
2 2010-01-01 06:02:00    80.6        80.600000
3 2010-01-01 06:05:00    50.1        65.350000
4 2010-01-01 06:20:00   346.5       159.066667
5 2010-01-01 07:44:00   111.8       111.800000
6 2010-01-01 08:00:00   113.1       112.450000
7 2010-01-01 08:22:00    10.6        78.500000
8 2010-01-01 09:00:00   287.2       148.900000
9 2010-01-01 09:14:00  1652.6       650.133333

If you want to exclude the current line, you must track the amount and number of hours of skiing and deviate from the average value after setting for the current value.

s = df.set_index('date')
sagg = s.rolling('H').agg(['sum', 'count']).value.rename(columns=str.title)
agged = df.join(sagg, on='date')
agged

                 date   value     Sum  Count
0 2010-01-01 01:23:00    21.2    21.2    1.0
1 2010-01-01 01:33:00    63.4    84.6    2.0
2 2010-01-01 06:02:00    80.6    80.6    1.0
3 2010-01-01 06:05:00    50.1   130.7    2.0
4 2010-01-01 06:20:00   346.5   477.2    3.0
5 2010-01-01 07:44:00   111.8   111.8    1.0
6 2010-01-01 08:00:00   113.1   224.9    2.0
7 2010-01-01 08:22:00    10.6   235.5    3.0
8 2010-01-01 09:00:00   287.2   297.8    2.0
9 2010-01-01 09:14:00  1652.6  1950.4    3.0

Then do some math and assign a new column

df.assign(before_1hr_mean=agged.eval('(Sum - value) / (Count - 1)'))

                 date   value  before_1hr_mean
0 2010-01-01 01:23:00    21.2              NaN
1 2010-01-01 01:33:00    63.4            21.20
2 2010-01-01 06:02:00    80.6              NaN
3 2010-01-01 06:05:00    50.1            80.60
4 2010-01-01 06:20:00   346.5            65.35
5 2010-01-01 07:44:00   111.8              NaN
6 2010-01-01 08:00:00   113.1           111.80
7 2010-01-01 08:22:00    10.6           112.45
8 2010-01-01 09:00:00   287.2            10.60
9 2010-01-01 09:14:00  1652.6           148.90

, , .

+3

Source: https://habr.com/ru/post/1675758/


All Articles