Does indexing a piece of pandas improve data?

I have a pandas dataframe containing over a million records. One of its columns is the date and time. A sample of my data is as follows:

time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...

I need to effectively get a report for a certain period. The following naive way is very laborious.

new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]

I know that in DBMSs such as MySQL, time indexing is effective for retrieving records by specifying a time period.

My question

  • Indexing pandas, for example df.index = df.time, speeds up the slicing process?
  • If the answer of Q1 is “No,” then what is the general effective way to get a record for a certain period of time in pandas?
+4
source share
2 answers

1 . Pandas .

df = pd.DataFrame(np.random.randn(1000000, 3), 
                  columns=list('ABC'), 
                  index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))

( v. 0.14.1 ( ...) 0.17.1):

%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)

%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)

%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)

%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)

Datetime :

df.reset_index(inplace=True)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)

%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)

:

>>> df2.shape
(250560, 3)

, , , Pandas.

+7

, , , , , . - .

timedata.txt (extended from your example):

time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
2015-05-01 10:00:05,112,223,335
2015-05-01 10:00:08,112,223,336
2015-05-01 10:00:13,112,223,337
2015-05-01 10:00:21,112,223,338

df = pd.read_csv('timedata.txt')
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
print(df['2015-05-01 10:00:02':'2015-05-01 10:00:14'])

                       x    y    z
time                              
2015-05-01 10:00:03  112  223  334
2015-05-01 10:00:05  112  223  335
2015-05-01 10:00:08  112  223  336
2015-05-01 10:00:13  112  223  337

, , , , , .

, datetime, . , , .

+2

Source: https://habr.com/ru/post/1618225/


All Articles