Dask dataframe how to convert a column to to_datetime

I am trying to convert one column of my data frame to a date and time. After discussing here https://github.com/dask/dask/issues/863 I tried the following code:

import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()

But I get the following error message

ValueError: Metadata inference failed, please provide 'meta' keyword

What exactly should I put under the meta? Should I put the dictionary of ALL columns in df or just the column 'time'? and what type should I put? I tried dtype and datetime64, but so far none of them work.

Thank you and I appreciate your guidance,

Refresh

I will include new error messages here:

1) Using a timestamp

df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()

TypeError: Cannot convert input to Timestamp

2) Using datetime and meta

meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'

3) Just use date and time: stuck at 2%

    In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[                                        ] | 2% Completed |  2min 20.3s

In addition, I would like to be able to specify the date format, as I would do in pandas:

pd.to_datetime(df['time'], format = '%m%d%Y'

Update 2

Dask 0.11 meta. , 2% 2 .

df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
    [                                        ] | 2% Completed |  30min 45.7s

3

:

def parse_dates(df):
  return pd.to_datetime(df['time'], format = '%m/%d/%Y')

df.map_partitions(parse_dates, meta=meta)

,

+9
5

astype

astype dtype NumPy dtype

df.time.astype('M8[us]')

, dtype Pandas ( )

map_partitions meta​​h3 >

, map_partitions, dask.dataframe . , docstring map_partitions.

Pandas dtype

meta = pd.Series([], name='time', dtype=pd.Timestamp)

(name, dtype) dict DataFrame

meta = ('time', pd.Timestamp)

df.time.map_partitions(pd.to_datetime, meta=meta)

map_partitions df , dtypes . .

+10

, , :

df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))
+4

ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))

+1

Dask to_timedelta, .

df['time']=dd.to_datetime(df.time,unit='ns')

, pd.to_timedelta . .

+1

map_partition ISO, map_partition :

import dask
import pandas as pd
from dask.distributed import Client
client = Client()

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime = ddf.datetime.astype('M8[s]')
ddf.compute()

11,3 ± 719 ( ± 7 , 1 )

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 


%%timeit
ddf.datetime_nonISO = (ddf.datetime_nonISO.map_partitions(pd.to_datetime
                       ,  format='%H:%M:%S %Y-%m-%d', meta=('datetime64[s]')))
ddf.compute()

8.78 s ± 599 ms per cycle (mean ± standard deviation of 7 cycles, 1 cycle each)

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime_nonISO = ddf.datetime_nonISO.astype('M8[s]')
ddf.compute()

1 min 8 s ± 3.65 s per cycle (mean ± standard deviation of 7 cycles, 1 cycle each)

0
source

Source: https://habr.com/ru/post/1655132/


All Articles