Java will not match. *

I have the following line in the file

00241386002|5296060|0|1|ClaimNote|29DEC2005:10:20:13.557194|JAR007| 

I'm trying to combine with

 line.matches("^\d+\|\d+\|\d+\|\d+.+$") 

This pattern works on previous ~ 10k or so lines in the file. It also works with the previous line, which matches the previous line. However, he does not work on this line. Even

 line.matches(".*") 

returns false.

Any help would be appreciated.

edits:

  • strings are created by a buffered reader, so \r and \n will be truncated.
  • already tried to clean and build, without cubes.

Answer:

  • thanks to Pshemo with the answer in the first comment. (? d). * (unix mode) also works. at the end of the line was "\ u0085", which the buffered reader did not trim, but considered the pattern to be a line terminator.
+5
source share
1 answer

Problem

\d+\|\d+\|\d+\|\d+ part of your regular expression seems to be working fine, which suggests that the problem should be related to the part .* .

Checks which characters cannot match by default . that may prevent matches returning true .
(I will test only characters in the range 0 - FFFF , but Unicode has more characters, such as surrogate pairs, so I am not saying that these are only characters that cannot match - even if it is today we cannot be sure of the future )

 for (int ch = 0; ch < '\uFFFF'; ch++) { if (!Character.toString((char)ch).matches(".*")) { System.out.format("%-4d hex: \\u%04x %n", ch, ch); } } 

We will get as a result (added some comments and links)

10 hex: \u000a - string (\ n)
13 hex: \u000d - carriage return (\ r)
133 hex: \u0085 - next line (NEL)
8232 hex: \u2028 - line separator
8233 hex: \u2029 - paragraph separator

Therefore, I suspect that your string contains one of these characters. Now, not all tools properly recognize these characters as regular line breaks (which the regular expression recognizes). For example, let's test BufferedReader

 String data = "AAA\nBBB\rCCC\u0085DDD\u2028EEE\u2029FFF"; BufferedReader br = new BufferedReader(new StringReader(data)); String line = null; while((line = br.readLine())!=null){ System.out.println(line); } 

we get the result:

  AAA
 BBB
 CCCDDD EEE FFF
    ⬑ here we have `\ u0085` (NEL) 

As you can see, tools that are not based on the regex engine can return a string that will be a single string, but will still contain characters that the regular expression sees as line separators.

Possible solutions

We can try to let . combine any characters. To do this, we can use the Pattern.DOTALL flag (we can enable it by adding (?s) to regex, for example (?s).* ).

In addition, as you already mentioned your question , we can set the regex engine in Pattern.UNIX_LINES ( (?d) flag) mode, which will make it see only \n as a line separator (other characters like \r will not be considered as line separators )

+5
source

Source: https://habr.com/ru/post/1201669/


All Articles