Match text with multiple regexes in python

I have a text body of 11 files, each of which has about 190,000 lines. I have 10 lines, one or more of which can be displayed in each line above the body.

When I come across any of the 10 lines, I need to write this line, which appears on the line separately. Regular expression loop brute force for each line and labeling takes a lot of time. Is there an effective way to do this?

I found a message ( Match string with multiple regexes using Python ) that provides TRUE or FALSE output. But how can I write the corresponding regular expression from a string:

any(regex.match(line) for regex in [regex1, regex2, regex3]) 

Edit: add an example

 regex = ['quick','brown','fox'] line1 = "quick brown fox jumps on the lazy dog" # i need to be able to record all of quick, brown and fox line2 = "quick dog and brown rabbit ran together" # i should record quick and brown line3 = "fox was quick an rabit was slow" # i should be able to record quick and fox. 

Looping through regex and writing a suitable one is one of the solutions, but looking at the scale (11 * 190,000 * 10), my script has been working for a while. I need to repeat this in my work quite a few times. so I looked at a more efficient way.

+4
source share
2 answers

Below is the approach in which you want a match. In case you need a regular expression on the list that caused the match, you're out of luck and probably need a loop.

Based on the link you provided :

 import re regexes= 'quick', 'brown', 'fox' combinedRegex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes)) lines = 'The quick brown fox jumps over the lazy dog', 'Lorem ipsum dolor sit amet', 'The lazy dog jumps over the fox' for line in lines: print combinedRegex.findall(line) 

outputs:

 ['quick', 'brown', 'fox'] [] ['fox'] 

The point here is that you do not loop over the regular expression, but combine them. The difference with the loopback approach is that re.findall will not find matching matches. For example, if your regular expressions were: regexes= 'bro', 'own' , the output of the lines above:

 ['bro'] [] [] 

whereas the looping method will result in:

 ['bro', 'own'] [] [] 
+6
source

If you are just trying to match literal strings, this is probably simpler:

 strings = 'foo','bar','baz','qux' regex = re.compile('|'.join(re.escape(x) for x in strings)) 

and then you can immediately check all of this:

 match = regex.match(line) 

Of course, you can get a string that matches the resulting MatchObject:

 if match: matching_string = match.group(0) 

In action:

 import re strings = 'foo','bar','baz','qux' regex = re.compile('|'.join(re.escape(x) for x in strings)) lines = 'foo is a word I know', 'baz is a word I know', 'buz is unfamiliar to me' for line in lines: match = regex.match(line) if match: print match.group(0) 

It seems that you are really looking for a string to search for your regular expression. In this case, you need to use re.search (or some option) and not re.match , no matter what you do. As long as none of your regular expressions overlap, you can use my solution above with re.findall :

 matches = regex.findall(line) for word in matches: print ("found {word} in line".format(word=word)) 

+1
source

Source: https://habr.com/ru/post/1441543/


All Articles