Do I need to iterate over each row of data to calculate the time for each category of column?

Question

Do I need to iterate over each row of data to calculate the time for each category of column?

I have a list of data in python that looks like the table below.

Basically, this is caused by observing what our robot is doing in our maze / arena. We have timestamps for events, while timestamps are event driven, not periodic.

I need to find the time spent in each arena in an effective way.

TimeStamp   Arena
101         Arena A
109         Arena A
112         Arena B
113         Arena A
118         Arena A
120         Arena D
125         Arena D
129         Arena D
138         Arena B
139         Arena B
148         Arena C
149         Arena C
150         Arena B
151         Arena B
159         Arena D
169         Arena D
171         Arena D
172         Arena D
175         Arena B
177         Arena B
180         Arena B
181         Arena A
182         Arena A
189         Arena E
200         Arena E
204         Arena E
208         Arena A
209         Arena A

Basically, I need to get this below. Total time spent in each arena.

 Arena  TimeStamp
Arena D         32
Arena B         23
Arena E         22
Arena A         16
Arena C         10

I wrote a simple script that does this right now.

import pandas as pd

data = pd.read_csv('arenas_visited.csv')


l = len(data[[1]])
first_arena = data.loc[0, 'Arena']
start_time = data.loc[0, 'TimeStamp']

summary = []

for i in range(0,l):

try:
    next_arena = data.loc[i+1, 'Arena']
except:
    break     

first_arena = data.loc[i, 'Arena']   

if first_arena != next_arena:

    change_time = data.loc[i, 'TimeStamp']
    time_spent = change_time - start_time
    arena = str(data.loc[i, 'Arena'])
    summary.append([arena, time_spent])
    start_time = change_time
    first_arena = data.loc[i+1, 'Arena']   

    if i == l-2:
        if data.loc[i, 'Arena'] != data.loc[i+1, 'Arena']:
            time_spent = 1
            arena = str(data.loc[i+1, 'Arena'])
            print (str(1) + " Spent in " + arena)
            summary.append([arena, time_spent])

else:
    pass

aggregated = pd.DataFrame(summary, columns = ['Arena', 'TimeStamp'])
time_per_arena = aggregated.groupby(['Arena']).sum().sort_values('TimeStamp',  ascending=False).reset_index()
print time_per_arena

Basically, while this works quite well. However, in the end I will have literally millions of rows of this data, and I need to figure out how to do it faster.

, ?

- ?

+4

python pandas dataframe data-processing

Stacy Garfield 11 . '16 17:46

2

Boud · Answer 1 · 2016-11-11T17:53:01+0000

, :

df['delta'] = df.TimeStamp - df.TimeStamp.shift()

df.groupby('Arena').delta.sum()
Out[62]: 
Arena
Arena_A    21.0
Arena_B    23.0
Arena_C    10.0
Arena_D    32.0
Arena_E    22.0
Name: delta, dtype: float64

TemporalWolf · Answer 2 · 2016-11-11T19:11:02+0000

Python , . , :

result = {}
old_arena = None
old_timestamp = 0
# I don't have a lot of experience with panda, so you may need to massage the 
# input to be able to do this
for line in data:
    timestamp, _, arena = line.split()
    if arena == old_arena:
        continue
    timestamp = int(timestamp)
    try:
        result[old_arena] += timestamp - old_timestamp
    except:
        result[old_arena] = timestamp - old_timestamp

    old_arena = arena
    old_timestamp = timestamp

# Process the last interval - if the last one was changed, then
# old_timestamp will equal timestamp and this is fine    
result[old_arena] += int(timestamp) - old_timestamp

O(n) O(n+k) , k - .

, ( None ):

{'A': 27, 'C': 2, 'B': 26, 'E': 19, 'D': 34, None: 101}

: , old_arena, , .

, , :

result = {}
old_arena = None
old_timestamp = 0
# I don't have a lot of experience with panda, so you may need to massage the 
# input to be able to do this
for line in reversed(data):
    timestamp, _, arena = line.split()
    if arena == old_arena:
        continue
    timestamp = int(timestamp)
    try:
        result[old_arena] += old_timestamp - timestamp
    except:
        result[old_arena] = old_timestamp - timestamp

    old_arena = arena
    old_timestamp = timestamp

# Process the last interval - if the last one was changed, then 
# old_timestamp will equal timestamp and this is fine    
result[old_arena] += old_timestamp - int(timestamp)

:

{'A': 21, 'C': 10, 'B': 23, 'E': 22, 'D': 32, None: -209}

Do I need to iterate over each row of data to calculate the time for each category of column?

More articles: