Create a regex to match strings from list A but not from list B

Possible duplicate:
How to automatically generate a regular expression from a given list of strings?

I have two list of strings ListA and ListB . I need to create a regular expression that will match all rows in ListA and won't match any row in ListB .

  • Lines can contain any combination of characters, numbers, and punctuation.
  • If a string is displayed in ListA , it will not be in ListB .
  • If the string is not in either of these two lists, I don't care what the result of the match should be.

Lists usually contain thousands of lines, and the lines are pretty much alike.

I know a trivial answer to this question, which simply generates a regular expression of the form (Str1)|(Str2)|(Str3) , where StrN is a string from ListA. But I'm looking for a more efficient way to do this.

An ideal solution would be a kind of tool that will take two lists and generate a Java regular expression for this.

Update 1 . Under “effective,” I want to generate an expression that is less than a trivial solution. An ideal algorithm generates the shortest possible expression. Here are some examples.

 ListA = { C10 , C15, C195 } ListB = { Bob, Billy } 

The perfect expression would be

 /^C1.+$/ 

Another example, pay attention to the third element of ListB

 ListA = { C10 , C15, C195 } ListB = { Bob, Billy, C25 } 

Perfect expression

 /^C[^2]{1}.+$/ 

Last example

ListA = {A, D, E, F, H} ListB = {B, C, G, I}

The ideal expression is the same as the trivial solution, which

 /^(A|D|E|F|H)$/ 

In addition, I am not looking for the perfect solution, nothing is better than trivial. I thought about creating a list of trivial solutions, and then I tried to combine common substrings, observing that we were not wandering around the ListB territory.

** Update 2 *: I'm not particularly worried about the time it takes to create RegEx, something less than 10 minutes on a modern machine is acceptable

+4
source share
1 answer

If it is guaranteed that there will not be a single line in both lists, and you do not need lines that are in neither of them, you just need to match the lines in ListA; you can completely ignore ListB.

The “trivial answer” you mentioned is a perfectly reasonable solution. When you say you want a "more efficient" way, do you mean an efficient way to generate the regular expression itself or a way to generate a regular expression that matches lines more efficiently?

  • If you want to efficiently generate a regular expression, most string objects in languages ​​provide a way to combine a list of strings with a separator string (such as a comma) to create a single string. You really cannot become more effective than that.
  • If you want your expression to be true to reality, make sure you “compile” it before using it. (Most regex libraries have a function for this.) Compiling a regex means creating a state machine that the regex library actually uses to match operations. Any decent regex library should do a decent job of optimizing FSM, for example, matching common substrings if possible in the same FSM state.

Alternatively, you can completely dispense with regular expressions and simply iterate over ListA and compare each line of its own with the candidate line. In this case, individual comparisons can be faster, since finding exact matches of strings can compare strings in 4 or 8-byte fragments, while the regular expression should look at each character individually. But if you have many lines for comparison, you will repeatedly move the candidate line in memory. A regular expression, in contrast, can cross a candidate string once to see if it matches.

Try both approaches. See which is faster.

+1
source

Source: https://habr.com/ru/post/1439702/


All Articles