Regex on a two-dimensional matrix represented as a string

I am trying to find patterns in a 2D matrix represented as a string. Imagine the following:

// horizontal line String pat1 = "............." + "............." + "............." + "....XXXX....." + "............." + "............."; // vertical line String pat2 = "............." + "......X......" + "......X......" + "......X......" + "......X......" + "............."; 

Finding the first pattern will be trivial, the regular expression will look something like this:

 X+ 

In the second case, this is a little more complicated, but doable, since I know the number of columns and rows of the matrix:

 (X.{`WIDTH - 1`})+ 

When I ran into problems with the correct regular expression, we tried to figure out a way to recognize the following patterns:

 // fixed but unknown number of columns String pat3 = "............." + ".....XXX....." + ".....XXX....." + ".....XXX....." + ".....XXX....." + "............."; // variable number of columns String pat4 = "............." + ".....XXX....." + "....XXXXX...." + "...XXXXXXX..." + ".....XXX....." + "............."; 

I am looking for a way to create a regex pattern equivalent to:

 (X.{`WIDTH - PREVCOUNT`})+ 

Where PREVCOUNT is the length of the last matched pattern (I know that I missed the first X of the fourth line in pat4, but I can live with it). I know there are regular expressions in a regular expression, but I wonder if what I am trying to achieve is possible. Even if it was possible, I also worry about the performance associated with using lookaheads, as I do not quite understand how they work internally.

Is there a way to do this with a single regex check, or do I need to search line by line and then try to check if all X are adjacent?

Edit: As an explanation, I'm trying to find the “drops” of X. As long as there is a continuous X in columns / rows, it can be considered to belong to blob. A few examples:

 String blob1 = "............." + "......XX....." + "....XXXX....." + "...XXXXX....." + ".....XXX....." + "............."; String blob2 = "............." + ".....XXX....." + "....XXXXX....." + "...XXXXXXX..." + "....XXXXX...." + ".....XXX....."; String blob3 = "............." + ".....XXX....." + ".....XXX......" + ".....XXX....." + "............." + "............."; String notblob = "............." + "..XXX........" + "......XXX....." + "..XXX........." + ".............." + "............."; 

My solution doesn't have to be accurate, so I'm trying to use a lousy regex approach.

+6
source share
3 answers

This is not solvable using regular expressions.

Basically, you define a matrix as such:

 0^k1 X^l1 0^m1 0^k2 X^l2 0^m2 0^k3 X^l3 0^m3 000XX000 ^ ^ ^ klm 

Where, 0 ^ a means "character" 0 "is repeated once,"
k means repetitions 0 to X l means repetitions X m means repetitions 0 after X ki + li + mi = row_width, for any i

Now your blob criterion is as follows:

 mi + k(i+1) < row_width ki + m(i+1) < row_width these two conditions should meet for any i 

Regular languages ​​cannot match this pattern, they have no memory, so there is no regular expression for your problem.


The correct solution would include counting connected components for the number of individual components.

+2
source

One of the elegant solutions that, I think, would be to first suppress all single X-sequences, both horizontally and vertically, for example:

 String blob = "....."; blob.replaceAll("([^X])X([^X])", "$1.$2") .replaceAll("([^X].....)X(.....[^X])","$1.$2"); 

Then all other sequences of at least 2 Xes are blocks. Note that in order to overcome the same problem mentioned by sdanzig, you must first deploy the blob with the non-Xes border.

+1
source

I think I understand what you are trying to do here. The “prediction” you define is not enough information to match the pattern. You must take into account the “next width” to determine the number of points to check. However, I'm not sure if you really check even a trivial pattern. X + will match 5 X per line. And in your second pattern, the first or last line can be two Xs, and you would not notice this.

So here you can provide a similar check with pat3:

 (X{3}.{`WIDTH-3`})+ 

I probably broke another tab by repeating pattern X, but you need to do this to keep the pattern repeating according to the “X-block” that starts and stops.

pat4 is even harder. There is no real way to keep your checks in order by checking one line at a time. You can do it:

 (X{3}.{`WIDTH-4`}|X{5}.{`WIDTH-6`}|X{5}.{`WIDTH-6`}|X{3}.{`WIDTH-5`})+ 

But then you will be vulnerable to check the matrix with the rows included, and the points will be changed on each side of the X-blocks for placement. However, you can try checking all the lines at once:

 (X{3}.{`WIDTH-4`}X{5}.{`WIDTH-6`}X{5}.{`WIDTH-6`}X{3}.{`WIDTH-5`}) 

And it will not affect performance. Perhaps this would be more efficient because you only impose the overhead of starting compilation of the regex pattern + once.

Trivial note: If you use the width of the matrix for a multi-line row, it will not work. You need to add one to accommodate the newline character. Then you need to make sure your "." captures a newline character. In Java, you can use Pattern.DOTALL for this.

0
source

Source: https://habr.com/ru/post/957272/


All Articles