Creating new columns in a dataset based on column values ​​using Regex

This is my data frame.

index     duration 
1           7 year   
2           2day
3           4 week
4           8 month

I need to separate the numbers from time and put them in two new columns. The output is as follows:

index     duration         number     time
1           7 year          7         year
2           2day            2         day
3           4 week          4        week
4           8 month         8         month

This is my code:

df ['numer'] = df.duration.replace(r'\d.*' , r'\d', regex=True, inplace = True)
df [ 'time']= df.duration.replace (r'\.w.+',r'\w.+', regex=True, inplace = True )

But that will not work. Any suggestion?

I also need to create another column based on the values ​​of the time column. Thus, the new data set is as follows:

 index     duration         number     time      time_days
    1           7 year          7         year       365
    2           2day            2         day         1
    3           4 week          4        week         7
    4           8 month         8         month       30

df['time_day']= df.time.replace(r'(year|month|week|day)', r'(365|30|7|1)', regex=True, inplace=True)

Any suggestion?

+4
source share
2 answers

we can use Series.str.extract here:

In [67]: df[['number','time']] = df.duration.str.extract(r'(\d+)\s*(.*)', expand=True)

In [68]: df
Out[68]:
   index duration number    time
0      1   7 year      7    year
1      2     2day      2     day
2      3   4 week      4    week
3      4  8 month      8   month

RegEx explained - regex101.com is IMO one of the best RegEx online parsers, testers and interpreters

number integer dtype:

In [69]: df['number'] = df['number'].astype(int)

In [70]: df.dtypes
Out[70]:
index        int64
duration    object
number       int32
time        object
dtype: object

UPDATE:

In [167]: df['time_day'] = df['time'].replace(['year','month','week','day'], [365, 30, 7, 1], regex=True)

In [168]: df
Out[168]:
   index duration number    time  time_day
0      1   7 year      7    year       365
1      2     2day      2     day         1
2      3   4 week      4    week         7
3      4  8 month      8   month        30
+3

str.extract astype:

df = df['duration'].str.extract(r'(?P<number>\d+)\s*(?P<time>\w+)', expand=True)
#convert to int
df['number'] = df['number'].astype(int)
print (df)
   number   time
0       7   year
1       2    day
2       4   week
3       8  month

.

DataFrame:

df = df.join(df['duration'].str.extract(r'(?P<number>\d+)\s*(?P<time>\w+)', expand=True))
#convert to int
df['number'] = df['number'].astype(int)
print (df)
   index duration  number   time
0      1   7 year       7   year
1      2     2day       2    day
2      3   4 week       4   week
3      4  8 month       8  month

df[['number','time']] = df['duration'].str.extract(r'(\d+)\s*(\w+)', expand=True)
#convert to int
df['number'] = df['number'].astype(int)
print (df)
   index duration  number   time
0      1   7 year       7   year
1      2     2day       2    day
2      3   4 week       4   week
3      4  8 month       8  month
+2

Source: https://habr.com/ru/post/1680317/


All Articles