Using Python 3.6 and Pandas 0.19.2:
I have a DataFrame containing parsed transaction log files. Each row has a timestamp, contains a transaction, and can either represent the beginning or end of a transaction (therefore, each transaction has 1 line for the beginning and 1 line for the end).
Additional information may also be present on each end line.
I would like to highlight the duration of each transaction by subtracting the end date using startdate and storing additional information.
Input Example:
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''transactionid;event;datetime;info
1;START;2017-04-01 00:00:00;
1;END;2017-04-01 00:00:02;foo1
2;START;2017-04-01 00:00:02;
3;START;2017-04-01 00:00:02;
2;END;2017-04-01 00:00:03;foo2
4;START;2017-04-01 00:00:03;
3;END;2017-04-01 00:00:03;foo3
4;END;2017-04-01 00:00:04;foo4'''), sep=';', parse_dates=['datetime'])
Which gives the following DataFrame:
transactionid event datetime info
0 1 START 2017-04-01 00:00:00 NaN
1 1 END 2017-04-01 00:00:02 foo1
2 2 START 2017-04-01 00:00:02 NaN
3 3 START 2017-04-01 00:00:02 NaN
4 2 END 2017-04-01 00:00:03 foo2
5 4 START 2017-04-01 00:00:03 NaN
6 3 END 2017-04-01 00:00:03 foo3
7 4 END 2017-04-01 00:00:04 foo4
Expected Result:
A new data frame, such as:
transactionid start_date end_date duration info
0 1 2017-04-01 00:00:00 2017-04-01 00:00:02 00:00:02 foo1
1 2 2017-04-01 00:00:02 2017-04-01 00:00:03 00:00:01 foo2
2 3 2017-04-01 00:00:02 2017-04-01 00:00:03 00:00:01 foo3
3 4 2017-04-01 00:00:03 2017-04-01 00:00:04 00:00:01 foo4
What I tried:
, .groupby(by='transactionid')
. , "" .