Regular expression to filter desired rows from pandas dataframe

I work with pretty dirty data: a tariff table with the following form:

import pandas as pd
import numpy as np

data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
data2 = np.array(['10101010', '10101020'])
data = {'hscode': data2, 'tariff' : data1}

df = pd.DataFrame(data, columns=['hscode', 'tariff'])

The first line shows that for countries (A, B, KR, FR) the tariff is zero, and for JP it is 5%, and the second line shows that for A, B, FR it is zero, and for JP it is 5% KR.

I want to find the tariff rate of the country “KR” for each row so that I can have the following table:

'hscode' 'tariff'

10101010 0%

10101020 5%

So, I want to find the tariff rate for the county code "KR" in each cell.

+4
source share
2 answers

You can use apply with regex:

## -- End pasted text --

In [133]: import re

In [134]: df
Out[134]: 
     hscode                         tariff
0  10101010   Free (A, B, KR, FR), 5% (JP)
1  10101020  Free (A, B, FR), 5% (JP, KR))

In [135]: df['tariff'].apply(lambda x: ''.join(re.findall(r'.*(Free|\d+%).*\bKR\b', x)))
Out[135]: 
0    Free
1      5%
Name: tariff, dtype: object

: "", "x%", "KR".

"KR" .

+2
    import pandas as pd
    import numpy as np

    data1 = np.array([u'Free (A, B, KR, FR), 5% (JP)', u'Free (A, B, FR), 5% (JP, KR))'])
    data2 = np.array(['10101010', '10101020'])

    df = []
    for i, element in enumerate(data1):
        free, five = element.lstrip('Free (').rstrip(')').split('), 5% (')
        for country in free.split(', '):
            row = [data2[i], 'Free', country]
            df.append(row)
        for country in five.split(', '):
            row = [data2[i], '5%', country]
            df.append(row)
    df = pd.DataFrame(df, columns = ['hscode', 'tariff', 'country'])
    print df.query('country == "KR"')

     hscode tariff country
2  10101010   Free      KR
9  10101020     5%      KR
0

Source: https://habr.com/ru/post/1611941/


All Articles