A vectorized way to check dataframe values (like key, value tuple) for a dictionary?

Question

A vectorized way to check dataframe values (like key, value tuple) for a dictionary?

I would like to create a column in my data framework that checks if the values in one column are dictionary values of another column that contains the dictionary keys, for example:

In [3]:
df = pd.DataFrame({'Model': ['Corolla', 'Civic', 'Accord', 'F-150'],
                   'Make': ['Toyota', 'Honda', 'Toyota', 'Ford']})
dic = {'Prius':'Toyota', 'Corolla':'Toyota', 'Civic':'Honda', 
       'Accord':'Honda', 'Odyssey':'Honda', 'F-150':'Ford', 
       'F-250':'Ford', 'F-350':'Ford'}
df

Out [3]:
     Model    Make
0  Corolla  Toyota
1    Civic   Honda
2   Accord  Toyota
3    F-150    Ford

And after applying the function or whatever, I would like to see:

Out [10]:
     Model    Make   match
0  Corolla  Toyota    TRUE
1    Civic   Honda    TRUE
2   Accord  Toyota   FALSE
3    F-150    Ford    TRUE

Thanks in advance!

Edit: I tried to make a function that is passed in by a tuple that will be two columns, but I don't think I'm passing arguments correctly:

def is_match(make, model):
  try:
    has_item = dic[make] == model
  except KeyError:
    has_item = False
  return(has_item)

df[['Model', 'Make']].apply(is_match)

results in:
TypeError: ("is_match() missing 1 required positional 
argument: 'model'", 'occurred at index Model')

+4

python dictionary pandas boolean dataframe

Excel help Mar 15 '18 at 18:09

source share

3 answers

Comprehension

df.assign(match=[dic.get(md, '') == mk for mk, md in df.values])

     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

`dict.items` and `in`

items = dic.items()
df.assign(match=[t[::-1] in items for t in map(tuple, df.values)])

     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

`isin`

df.assign(match=pd.Series(list(map(tuple, df.values[:, ::-1]))).isin(dic.items()))

     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

Numpy Structured Arrays

dtype = [('Make', '<U6'), ('Model', '<U7')]
a = np.array([tuple(r) for r in df.values], dtype)
b = np.array(list(dic.items()), dtype[::-1])

df.assign(match=np.in1d(a, b))

     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

Time comparison

Conlcusions

The @wen method is much better!

Functions

def wen(df, dic):
    return df.assign(match=df.Model.map(dic).eq(df.Make))

def maxu(df, dic):
    return df.assign(match=df[['Make', 'Model']].sum(axis=1).isin(set([v+k for k, v in dic.items()])))

def pir1(df, dic):
    return df.assign(match=[dic.get(md, '') == mk for mk, md in df.values])

def pir2(df, dic):
    items = dic.items()
    return df.assign(match=[t[::-1] in items for t in map(tuple, df.values)])

def pir3(df, dic):
    return df.assign(match=pd.Series(list(map(tuple, df.values[:, ::-1]))).isin(dic.items()))

def pir4(df, dic):
    dtype = [('Make', '<U6'), ('Model', '<U7')]
    a = np.array([tuple(r) for r in df.values], dtype)
    b = np.array(list(dic.items()), dtype[::-1])

    return df.assign(match=np.in1d(a, b))

Reverse test

res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'wen maxu pir1 pir2 pir3 pir4'.split()
)

for i in res.index:
    m = dict(dic.items())
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = f'{j}(d, m)'
        setp = f'from __main__ import {j}, m, d'
        res.at[i, j] = timeit(stmt, setp, number=200)

results

res.plot(loglog=True)

res.div(res.min(1), 0)

            wen       maxu       pir1       pir2       pir3       pir4
10     2.041111   2.799885   1.000000   1.032221   1.432887   1.174196
30     1.544264   2.417550   1.000000   1.043218   1.336503   1.003284
100    1.037501   1.843029   1.000000   1.066310   1.319942   1.191763
300    1.000000   2.373917   1.726667   2.009198   2.193276   2.424844
1000   1.000000   3.962928   3.764808   3.932539   4.099261   4.971527
3000   1.000000   6.250289   6.311701   6.740862   6.258989   7.791234
10000  1.000000   9.014925  10.110949  10.964482  10.347168  13.407998
30000  1.000000  10.410604  11.682759  13.113974  11.877862  16.000993

+3

piRSquared Mar 15 '18 at 18:27

source share

another option:

In [38]: df['match'] = df[['Make','Model']] \
                         .sum(axis=1) \
                         .isin(set([v+k for k,v in dic.items()]))

In [39]: df
Out[39]:
     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

+2

Maxu Mar 15 '18 at 18:52

source share

Wen · Accepted Answer · 2018-03-15T18:13:34+0000

you can use map

df.assign(match=df.Model.map(dic).eq(df.Make))
Out[129]: 
     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

A vectorized way to check dataframe values ​​(like key, value tuple) for a dictionary?

Comprehension

dict.items and in

isin

Numpy Structured Arrays

Time comparison

Conlcusions

Functions

Reverse test

results

More articles:

A vectorized way to check dataframe values (like key, value tuple) for a dictionary?

`dict.items` and `in`

`isin`