Python: check if any word in the word list matches any pattern in the regular expression pattern list

I have a long list of words and regular expression patterns in a .txt file that I read like this:

with open(fileName, "r") as f1: pattern_list = f1.read().split('\n') 

to illustrate, the first seven are as follows:

 print pattern_list[:7] # ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*'] 

I want to know when I match a word from an input line with any of the words / patterns in pattern_list. Below is the kind of work, but I see two problems:

  • Firstly, it seems that re.compile () is ineffective every element in my list_list template every time I check a new string_input ... but when I tried to save re.compile (raw_str) objects in a list (then be able to reuse already compiled regex list for something more like if w in regex_compile_list: this didn't work correctly.)
  • Secondly, sometimes it doesn’t work, as I expect - note how
    • abuse * corresponds to abuse
    • abusi * corresponds to abuses and abuses
    • ache * corresponds to the patient

What am I doing wrong, and how can I be more effective? Thank you in advance for your patience with noob and thanks for any insight!

 string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression." for raw_str in pattern_list: pat = re.compile(raw_str) for w in string_input.split(): if pat.match(w): print "matched:", raw_str, "with:", w #matched: abandon* with: abandoned #matched: abandon* with: abandon #matched: abuse* with: abused #matched: abuse* with: abusive, #matched: abuse* with: abuse #matched: abusi* with: abused #matched: abusi* with: abusive, #matched: abusi* with: abuse #matched: ache* with: aching #matched: aching with: aching #matched: advers* with: adversarial, #matched: afraid with: afraid #matched: aggress* with: aggressive #matched: aggress* with: aggression. 
+4
source share
4 answers

To match shell-style wildcards, you can (ab) use the fnmatch module

Since fnmatch is the main one for matching file names, the test will be case-sensitive or case-insensitive. So you have to normalize both the text and the template (here I use lower() for this purpose)

 >>> import fnmatch >>> pattern_list = ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*'] >>> string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression." >>> for pattern in pattern_list: ... l = fnmatch.filter(string_input.split(), pattern) ... if l: ... print pattern, "match", l 

Production:

 abandon* match ['abandoned', 'abandon'] abuse* match ['abused', 'abuse'] abusi* match ['abusive,'] aching match ['aching'] advers* match ['adversarial,'] afraid match ['afraid'] aggress* match ['aggressive', 'aggression.'] 
+8
source

abandon* will match abandonnnnnnnnnnnnnnnnnnnnnnn , not abandonasfdsafdasf . Do you want to

 abandon.* 

instead.

+2
source

If * everything is at the end of the lines, you can do something like this:

 for pat in pattern_list: for w in words: if pat[-1] == '*' and w.startswith(pat[:-1]) or w == pat: # Do stuff 
+2
source

If the templates used regex syntax:

 m = re.search(r"\b({})\b".format("|".join(patterns)), input_string) if m: # found match 

Use (?:\s+|^) and (?:\s+|$) instead of \b if the words are separated by a space.

+1
source

Source: https://habr.com/ru/post/1485825/


All Articles