`re.sub ()` in pandas

Let's say I have:

s = 'white male, 2 white females'

And I want to "deploy" this:

'white male, white female, white female'

A more complete list of cases:

  • 'two Hispanic men, two Hispanic women,
    • โ†’ 'Hispanic man, Hispanic man, Hispanic woman, Hispanic woman
  • '2 black males, white male'
    • โ†’ "black man, black man, white man"

It looks like I'm close with:

import re

# Do I need boundaries here?
mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')

# This works:
s = 'white male, 2 white females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'white male, white female, white female'

# This fails:
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# ' ,  , hispanic males, hispanic female, hispanic female,'

What creates a trigger in the second case?

Bonus question: Is there a pandas' Series method that implements this function directly instead of using it Series.apply()? Sorry to review my question and spend any time here.

For example, on:

s = pd.Series(
    ['white male',
     'white male, white female',
     'hispanic male, 2 hispanic females',
     'black male, 2 white females'])

Is there a faster route than:

s.apply(lambda x: mult.sub(..., x))
+4
source share
3 answers

IIUC, paranthesis two|2 (two|2), .

import re

mult = re.compile('(two|2) (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'hispanic male, hispanic male, hispanic female, hispanic female'
+1

"" , pandas.Series.str.replace, pandas.Series.str , :

In [10]: import re

In [11]: import pandas as pd

In [12]: s = pd.Series(
    ...:     ['white male',
    ...:      'white male, white female',
    ...:      'hispanic male, 2 hispanic females',
    ...:      'black male, 2 white females'])

In [13]: mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
    ...:

In [14]: s.str.replace(mult, r'\g<race> \g<gender>, \g<race> \g<gender>')
Out[14]:
0                                         white male
1                           white male, white female
2    hispanic male, hispanic female, hispanic female
3             black male, white female, white female
dtype: object

, .apply . , object dtypes.

, , . , , , , , .

+1

, , .

In [14]: mult = re.compile('(?:two|2) ([^,]+)')

In [15]: s = 'two hispanic males, 2 hispanic females'

In [16]: mult.sub(lambda x: x.group(1) + ' ' + x.group(1), s)
Out[16]: 'hispanic males hispanic males, hispanic females hispanic females'

Pandas Series - :

In [29]: s = pd.Series(                                     
    ['white male',
     'white male, white female',
     'hispanic male, 2 hispanic females',
     'black male, 2 white females'])

In [30]: %timeit s.str.replace('(?:two|2) (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s', r'\g<race> \g<gender>, \g<race> \g<gender>')
1000 loops, best of 3: 205 ยตs per loop

In [31]: %timeit s.apply(lambda x: mult.sub(lambda x: x.group(1) + ' ' + x.group(1), x))
10000 loops, best of 3: 148 ยตs per loop

In [32]: %timeit [mult.sub(lambda x: x.group(1) + ' ' + x.group(1), i) for i in s]
100000 loops, best of 3: 14.6 ยตs per loop
+1

Source: https://habr.com/ru/post/1692434/


All Articles