Regular expression to match specific words wrapped in arbitrary positions and split into two lines

I want to search for a text file for a given word, which can be defined in an unknown place inside the word and be divided into consecutive lines.

eg. the correspondence is "carried over" inside:

This sentence contains a hyphena- ted word. 

Nearest (unattractive) solution:

 "h\(-\s*\n\s*\)\?y\(-\s*\n\s*\)\?p\(-\s*\n\s*\)\?h\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?n\(-\s*\n\s*\)\?a\(-\s*\n\s*\)\?t\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?d" 

I hope that some regex-foo might appear stronger than mine with a regular expression that explicitly includes the search word, i.e. I would like to see a hyphen. I did not find a way to encode something like the following (which would be a mistake anyway, since it would match "hy-ted"):

 "{prefix-of:hyphenated}{hyphen/linebreak}{suffix-of:hyphenated}" 

I understand that preprocessing a document to collapse such words will simplify the search, but I am looking for a regular expression that I can use in a context where this is not possible due to the tools involved.

+4
source share
3 answers

Given that hy-phen-ated should also match, I think this is the case when one of the regular expressions is not the right way.

I would do this (without knowing your language, I used pseudocode):

  • remove hyphens and newlines from input
  • matches cleaned input with .*hyphenated.*

All languages ​​can do step 1. trivially, and the code will be so readable.

+1
source

I think this will work. If you have a lot of words to search for, you probably want to create a script to create a search template for you.

 [h\-]+\s*[y\-\s]+[p\-\s]+[h\-\s]+[e\-\s]+[n\-\s]+[a\-\s]+[t\-\s]+[e\-\s]+d\b 

I don't think you mentioned which language you use, but I checked it with .Net.

Here is a simple python script that will generate search patterns:

 # patterngen.py # Usage: python patterngen.py <word> # Example: python patterngen.py hyphenated word = sys.argv[1] pattern = '[' + word[0] + r'\-]+\s*' for i in range(1,len(word)-1): pattern = pattern + r'[' + word[i] pattern = pattern + r'\-\s]+' pattern = pattern + word[-1] + r'\b' print pattern 
0
source

Another way to approach this, right off the bat, is to β€œshift” the transfer this way:

 hyphenated|h(-\s*\n\s*)yphenated|hy(-\s*\n\s*)phenated|hyp(-\s*\n\s*)henated|hyph(-\s*\n\s*)enated|hyphe(-\s*\n\s*)nated|hyphen(-\s*\n\s*)ated|hyphena(-\s*\n\s*)ted|hyphenat(-\s*\n\s*)ed|hyphenate(-\s*\n\s*)d 

It reads better, but I really don't know how much this matches the performance of your original template.


Another idea is to narrow the search first using a pattern along these lines:

 h[hypenatd]{0,9}(-\s*\n*\s)?[hypenatd]{0,9} 

and then match in the results of this.

In fact, if I'm not mistaken, if you agree with such groups:

 (h[hypenatd]{0,9})(?:-\s*\n*\s)?([hypenatd]{0,9}) 

then occurrences of the word hyphenated are all matches, where, in pseudocode:

 (match.group1 + match.group2) == "hyphenated" 
0
source

Source: https://habr.com/ru/post/1442147/


All Articles