Python cleanup date to convert per year to Pandas only

I have a large dataset that some users put in data on csv. I converted CSV to data frame with panda. A column of more than 1000 entries shows an example

datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013

Then I tried to convert dates to years using

df['year']=df['datestart'].astype('timedelta64[Y]')

But this gave me an error:

ValueError: Value cannot be converted into object Numpy Time delta

Using Datetime64

df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')

he gave:

"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"

Since this column was populated by users, most were in this format MM / DD / YYYY, but some data was placed as follows: February 10, 2013 and there was one record, such as 00/00/0000. I assume that different formats screwed up the processing.

Is there try loop, if statementor something, that I can skip such problems?

, str.extract script, :

year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")


del df['month'], df['day']  

concat, .

df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]') :

Message File Name   Line    Position    
Traceback               
    <module>    C:\Users\0\Desktop\python\Example.py    23      
    astype  C:\Python33\lib\site-packages\pandas\core\generic.py    2062        
    astype  C:\Python33\lib\site-packages\pandas\core\internals.py  2491        
    apply   C:\Python33\lib\site-packages\pandas\core\internals.py  3728        
    astype  C:\Python33\lib\site-packages\pandas\core\internals.py  1746        
    _astype C:\Python33\lib\site-packages\pandas\core\internals.py  470     
    _astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222        
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]        
+4
1

datetime to_datetime():

df['datestart'] = pd.to_datetime(df['datestart'], coerce=True)

( coerce=True NaT).

, (, pandas , values numpy):

df['datestart'].values.astype('datetime64[Y]')

, - NaT ( , , df = df.dropna()). , , datetime64[ns], pandas . , , :

df['year'] =  pd.DatetimeIndex(df['datestart']).year

.

+5

Source: https://habr.com/ru/post/1544983/


All Articles