Pandas dataframe split into sessions

Question

Pandas dataframe split into sessions

This is an extension for my question .

Make it easier. Suppose I have a pandas framework as shown below.

df = pd.DataFrame([[1.1, 1.1, 2.5, 2.6, 2.5, 3.4,2.6,2.6,3.4], list('AAABBBBAB'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3]]).T df.columns = ['col1', 'col2','col3']

dataframe:

  col1 col2 col3 0 1.1 A 1.1 1 1.1 A 1.7 2 2.5 A 2.5 3 2.6 B 2.6 4 2.5 B 3.3 5 3.4 B 3.8 6 2.6 B 4 7 2.6 A 4.2 8 3.4 B 4.3

I want to group this based on some conditions. The logic is based on the values of col1 col2 and the cumulative difference col3:

Go to col1 and find other occurrences of the same value.
In my case, the first value of col1 is "1.1", and again their single value in line2.
Then check the col2 value if they are similar, then get the cumulative difference in col 3.
If the cumulative difference is greater than 0.5, then mark this as a new session.
If the col1 values are the same, but the col2 values are different, then mark them as a new session

expected output:

  col1 col2 col3 session 0 1.1 A 1.1 0 1 1.1 A 1.7 1 2 2.5 A 2.5 2 3 2.6 B 2.6 4 4 2.5 B 3.3 3 5 3.4 B 3.8 7 6 2.6 B 4 5 7 2.6 A 4.2 6 8 3.4 B 4.3 7

+6

python pandas dataframe

Nilani algiriyage Jul 10 '13 at 10:38

source share

1 answer

Andy hayden · Accepted Answer · 2013-07-10T11:32:19+0000

As in the excellent answer you linked to;) first create the session number:

 In [11]: g = df.groupby(['col1', 'col2']) In [12]: df['session_number'] = g['col3'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))

Then I think that you want to set the value_value of these columns, this may be enough for many use cases (although it might be worth doing sort ):

 In [13]: df1 = df.set_index(['col1', 'col2', 'session_number']) In [14]: df1 Out[14]: col3 col1 col2 session_number 1.1 A 0 1.1 1 1.7 2.5 A 0 2.5 2.6 B 0 2.6 2.5 B 0 3.3 3.4 B 0 3.8 2.6 B 1 4 A 0 4.2 3.4 B 0 4.3

If you really want, you can take off the session number:

 In [15]: g1 = df.groupby(['col1', 'col2', 'session_number']) # I think there is a slightly neater way, but I forget.. In [16]: df1['session'] = g1.apply(lambda x: 1).cumsum() # could -1 here if it matters In [17]: df1 Out[17]: col3 session col1 col2 session_number 1.1 A 0 1.1 1 1 1.7 2 2.5 A 0 2.5 3 2.6 B 0 2.6 6 2.5 B 0 3.3 4 3.4 B 0 3.8 8 2.6 B 1 4 7 A 0 4.2 5 3.4 B 0 4.3 8

If you want this in columns (as in your question) reset_index , and you could delete the session column:

 In [18]: df1.reset_index() Out[18]: col1 col2 session_number col3 session 0 1.1 A 0 1.1 1 1 1.1 A 1 1.7 2 2 2.5 A 0 2.5 3 3 2.6 B 0 2.6 6 4 2.5 B 0 3.3 4 5 3.4 B 0 3.8 8 6 2.6 B 1 4 7 7 2.6 A 0 4.2 5 8 3.4 B 0 4.3 8

Pandas dataframe split into sessions

More articles: