Pandas Speed ​​Vs SQL

I hear different opinions about when to use Pandas vs, when to use SQL.

I tried to do the following in Pandas on 19 150 869 data lines:

for idx, row in df.iterrows(): tmp = int((int(row['M']) / PeriodGranularity))+1 row['TimeSlot'] = str(row["D"]+1) + "-" + str(row["H"]) + "-" + str(tmp) 

And it turned out that it was so long that I had to cancel after 20 minutes.

In SQLLite, I did the following:

 Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M,cast(strftime('%M',PlayedTimestamp) / 15+1 as int) as TimeSlot from tblMain 

and found that it took 4 seconds ("19150869 lines returned in 2445 ms").

Note: For Pandas code, I ran this in the step in front of it to get data from db:

 sqlStr = "Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M from tblMain" df = pd.read_sql_query(sqlStr, con) 

Is this my coding, which is to blame here or is it generally accepted that SQL is much faster for certain tasks?

+5
source share
1 answer

It seems you can use a vectorized solution ( PeriodGranularity is some variable):

 df['TimeSlot'] = (df["D"]+1).astype(str) + "-" + df["H"].astype(str) + "-" + ((df['M'].astype(int) / PeriodGranularity).astype(int)+1).astype(str) 

And to parse datetime on str use strftime .

DataFrame.iterrows really slow - check this out .

First, some sample code for users coming from an SQL background .

Comapring 2 technologies are really complex, and I'm not sure if there is any good answer in SO (too broad reasons), but I find it .

+5
source

Source: https://habr.com/ru/post/1269099/


All Articles