Apache Pig - MATCHES with Multiple Matching Criteria

I am trying to take logical matching criteria, for example:

(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ 

and apply this as a match to the file in the pig using

 result = filter inputfields by text matches (some regex expression here)); 

The problem is that I have no idea how to translate the boolean expression above into a regular expression for the match method.

I was looking for different things, and closest I came to something like this:

 ((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b) 

Any ideas? I also need to try to do this conversion programmatically, if possible.

Some examples:

a - A quick brown Foo jumped over a lazy test (this should pass as it contains foo and test)

b - something happens in TestZ (it also passes because it contains testZ)

c - a quick brown Foo jumped over a lazy dog ​​(this should fail because it contains Foo, but not test, testA or TestB)

thanks

+4
source share
2 answers

Since you are using Pig, you really don't need the regular expression involved, you can just use the logical operators supplied by the pig in combination with a few light regular expressions, for example:

 T = load 'matches.txt' as (str:chararray); F = filter T by ((str matches '.*(Foo|Foo Bar|FooBar).*' and str matches '.*(test|testA|TestB).*') or str matches '.*TestZ.*'); dump F; 
+12
source

You can use this regular expression for the matches method.

 ^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).* 
  • note that "Foo" OR "Foo Bar" OR "FooBar" should be written as FooBar|Foo Bar|Foo not Foo|Foo Bar|FooBar to prevent only Foo from matching in the line containing FooBar or Foo Bar
  • also, since look-ahead is zero width, you need to pass .* at the end of the regex so that matches match the entire string.

Demo

 String[] data = { "The quick brown Foo jumped over the lazy test", "the was something going on in TestZ", "the quick brown Foo jumped over the lazy dog" }; String regex = "^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).*"; for (String s : data) { System.out.println(s.matches(regex) + " : " + s); } 

exit:

 true : The quick brown Foo jumped over the lazy test true : the was something going on in TestZ false : the quick brown Foo jumped over the lazy dog 
+1
source

Source: https://habr.com/ru/post/1500012/


All Articles