Regular expression to match specific words wrapped in arbitrary positions and split into two lines

Question

Regular expression to match specific words wrapped in arbitrary positions and split into two lines

I want to search for a text file for a given word, which can be defined in an unknown place inside the word and be divided into consecutive lines.

eg. the correspondence is "carried over" inside:

This sentence contains a hyphena- ted word.

Nearest (unattractive) solution:

 "h\(-\s*\n\s*\)\?y\(-\s*\n\s*\)\?p\(-\s*\n\s*\)\?h\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?n\(-\s*\n\s*\)\?a\(-\s*\n\s*\)\?t\(-\s*\n\s*\)\?e\(-\s*\n\s*\)\?d"

I hope that some regex-foo might appear stronger than mine with a regular expression that explicitly includes the search word, i.e. I would like to see a hyphen. I did not find a way to encode something like the following (which would be a mistake anyway, since it would match "hy-ted"):

 "{prefix-of:hyphenated}{hyphen/linebreak}{suffix-of:hyphenated}"

I understand that preprocessing a document to collapse such words will simplify the search, but I am looking for a regular expression that I can use in a context where this is not possible due to the tools involved.

+4

regex multiline hyphen

user1775138 Oct 25 '12 at 18:55

source share

3 answers

Bohemian · Answer 1 · 2012-10-25T19:22:38+0000

Given that hy-phen-ated should also match, I think this is the case when one of the regular expressions is not the right way.

I would do this (without knowing your language, I used pseudocode):

remove hyphens and newlines from input
matches cleaned input with .*hyphenated.*

All languages can do step 1. trivially, and the code will be so readable.

David · Answer 2 · 2012-10-25T19:17:16+0000

I think this will work. If you have a lot of words to search for, you probably want to create a script to create a search template for you.

 [h\-]+\s*[y\-\s]+[p\-\s]+[h\-\s]+[e\-\s]+[n\-\s]+[a\-\s]+[t\-\s]+[e\-\s]+d\b

I don't think you mentioned which language you use, but I checked it with .Net.

Here is a simple python script that will generate search patterns:

 # patterngen.py # Usage: python patterngen.py <word> # Example: python patterngen.py hyphenated word = sys.argv[1] pattern = '[' + word[0] + r'\-]+\s*' for i in range(1,len(word)-1): pattern = pattern + r'[' + word[i] pattern = pattern + r'\-\s]+' pattern = pattern + word[-1] + r'\b' print pattern

famousgarkin · Answer 3 · 2012-10-25T19:39:46+0000

Another way to approach this, right off the bat, is to “shift” the transfer this way:

 hyphenated|h(-\s*\n\s*)yphenated|hy(-\s*\n\s*)phenated|hyp(-\s*\n\s*)henated|hyph(-\s*\n\s*)enated|hyphe(-\s*\n\s*)nated|hyphen(-\s*\n\s*)ated|hyphena(-\s*\n\s*)ted|hyphenat(-\s*\n\s*)ed|hyphenate(-\s*\n\s*)d

It reads better, but I really don't know how much this matches the performance of your original template.

Another idea is to narrow the search first using a pattern along these lines:

 h[hypenatd]{0,9}(-\s*\n*\s)?[hypenatd]{0,9}

and then match in the results of this.

In fact, if I'm not mistaken, if you agree with such groups:

 (h[hypenatd]{0,9})(?:-\s*\n*\s)?([hypenatd]{0,9})

then occurrences of the word hyphenated are all matches, where, in pseudocode:

 (match.group1 + match.group2) == "hyphenated"

Regular expression to match specific words wrapped in arbitrary positions and split into two lines

More articles: