Odd problem with .isin () and strings (Python / Pandas)

I am having a strange problem with the Pandas.isin () method. I am doing a project in which I need to identify bad passwords by length, general word / password lists, etc. (Do not worry, this is from a public source). One way is to find out if someone is using their name as a password. I use .isin () to determine if this is the case, but it gives me weird results. To show:

# Extracting first and last names into their own columns users['first_name'] = users.user_name.str.extract('(^.+)(\.)', expand = False)[0] users['last_name'] = users.user_name.str.extract('\.(.+)', expand = False) # Flagging the users with passwords that matches their names users['uses_name'] = (users['password'].isin(users.first_name)) | (users['password'].isin(users.last_name)) # Looking at the new data print(users[users['uses_name']][['password','user_name','first_name','last_name','uses_name']].head()) 

The result of this:

  password user_name first_name last_name uses_name 7 murphy noreen.hale noreen hale True 11 hubbard milford.hubbard milford hubbard True 22 woodard jenny.woodard jenny woodard True 30 reid rosanna.reid rosanna reid True 58 golden rosalinda.rodriquez rosalinda rodriquez True 

This is mostly good; milford.hubbard uses hubbard as a password, etc. But then we have a few examples, such as the first. Norina Hale is noticed, despite the fact that her password is "muddy", which has only one letter with her name.

I can’t understand for life what causes this. Does anyone know why this is happening and how to fix it?

+5
source share
2 answers

Since you need to compare adjacent columns on the same row, there is not much to vectorize. So you can use the (possibly) fastest alternative at your disposal: list comprehension:

 df['uses_name'] = [ pwd in name for name, pwd in zip(df.user_name, df.password) ] 

Or, if you don't like loops, you can hide them with np.vectorize :

 def f(name, pwd): return pwd in name v = np.vectorize(f) df['uses_name'] = v(df.user_name, df.password) 

 df password user_name uses_name 7 murphy noreen.hale False 11 hubbard milford.hubbard True 22 woodard jenny.woodard True 30 reid rosanna.reid True 58 golden rosalinda.rodriquez False 

Given that you are extracting first_name and last_name from user_name , I don't think you need it here.

+4
source

Regarding the cause of this error:

If you execute users['password'].isin(users.first_name) , you request each row users['password'] if the element is contained in ANY element in the first_name column. Therefore, I assume that the murphy element is somewhere in this column

+1
source

Source: https://habr.com/ru/post/1275719/


All Articles