Possible duplicate:
How to automatically generate a regular expression from a given list of strings?
I have two list of strings ListA and ListB . I need to create a regular expression that will match all rows in ListA and won't match any row in ListB .
- Lines can contain any combination of characters, numbers, and punctuation.
- If a string is displayed in ListA , it will not be in ListB .
- If the string is not in either of these two lists, I don't care what the result of the match should be.
Lists usually contain thousands of lines, and the lines are pretty much alike.
I know a trivial answer to this question, which simply generates a regular expression of the form (Str1)|(Str2)|(Str3) , where StrN is a string from ListA. But I'm looking for a more efficient way to do this.
An ideal solution would be a kind of tool that will take two lists and generate a Java regular expression for this.
Update 1 . Under “effective,” I want to generate an expression that is less than a trivial solution. An ideal algorithm generates the shortest possible expression. Here are some examples.
ListA = { C10 , C15, C195 } ListB = { Bob, Billy }
The perfect expression would be
/^C1.+$/
Another example, pay attention to the third element of ListB
ListA = { C10 , C15, C195 } ListB = { Bob, Billy, C25 }
Perfect expression
/^C[^2]{1}.+$/
Last example
ListA = {A, D, E, F, H} ListB = {B, C, G, I}
The ideal expression is the same as the trivial solution, which
/^(A|D|E|F|H)$/
In addition, I am not looking for the perfect solution, nothing is better than trivial. I thought about creating a list of trivial solutions, and then I tried to combine common substrings, observing that we were not wandering around the ListB territory.
** Update 2 *: I'm not particularly worried about the time it takes to create RegEx, something less than 10 minutes on a modern machine is acceptable