Pandas Casting ISO String to datetime64

I want to quickly write about ~ 10-20M ISO time strings accurate to microseconds up to datetime64 for use as a DataFrame index in pandas.

I am on pandas 0.9 and have tried the suggested git solutions, but I find it 20-30 minutes or never ends.

I think I found the problem. Compare the speed of the two:

rng = date_range('1/1/2000', periods=2000000, freq='ms') strings = [x.strftime('%Y-%m-%d %H:%M:%S.%f') for x in rng] timeit to_datetime(strings) 

On my laptop ~ 300 ms.

 rng = date_range('1/1/2000', periods=2000000, freq='ms') strings = [x.strftime('%Y%m%dT%H%M%S.%f') for x in rng] timeit to_datetime(strings) 

On my laptop, forever and a day.

I'm probably going to just change the C ++ code that generates the timestamps to put them in a more detailed ISO form at the moment, because looping and fixing the format on tens of millions of marks is probably pretty slow ...

+4
source share
1 answer

The fast parser code only handles the standard ISO-8601 with dashes and colons - and, as you can see, it is very fast when the lines are in the correct format. If you can convince that the code is included in GitHub and can definitely be improved to handle more cases (preferably without slowing down the standard format too much).

As a partially satisfying workaround, you can use datetime.strptime to convert strings to datetime.datetime , then pass this result to to_datetime :

 In [4]: paste rng = date_range('1/1/2000', periods=2000000, freq='ms') strings = [x.strftime('%Y%m%dT%H%M%S.%f') for x in rng] ## -- End pasted text -- In [5]: iso_strings = [x.strftime('%Y-%m-%d %H:%M:%S.%f') for x in rng] In [6]: %timeit result = to_datetime(iso_strings) 1 loops, best of 3: 479 ms per loop In [7]: f = lambda x: datetime.strptime(x, '%Y%m%dT%H%M%S.%f') In [8]: f(strings[0]) Out[8]: datetime.datetime(2000, 1, 1, 0, 0) In [9]: %time result = to_datetime(map(f, strings)) CPU times: user 48.47 s, sys: 0.01 s, total: 48.48 s Wall time: 48.54 s 

It is 100x different, but much better than 1000 +% slower. I am sure that to_datetime could be improved to use the str version, which will be much faster. The exercise is left to the reader, I suppose

A todo for someday: http://github.com/pydata/pandas/issues/2213

+4
source

Source: https://habr.com/ru/post/1442914/


All Articles