Java Scanner syntax scan with regex (bug?)

I am developing a parser manually in Java, and I would like to use a regular expression to parse various types of tokens. The problem is that I would also like to be able to accurately report the current line number if the input does not match the syntax.

In short, I had a problem when I try to actually match a new line to a Scanner class. To be specific, when I try to match a new line to a template using the Scanner class, it fails. Almost always. But when I do the same match using Matcher and the same source string, it retrieves the new string exactly as you would expect it to. Is there a reason for this that I cannot detect, or is it a mistake, as I suspect?

FYI: I could not find an error in the Sun database that describes this problem, so if it was an error, it was not sent.

Code example:

Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE); String sourceString = "\r\n\n\r\r\n\n"; Scanner scan = new Scanner(sourceString); scan.useDelimiter(""); int count = 0; while (scan.hasNext(newLinePattern)) { scan.next(newLinePattern); count++; } System.out.println("found "+count+" newlines"); // finds 7 newlines Matcher match = newLinePattern.matcher(sourceString); count = 0; while (match.find()) { count++; } System.out.println("found "+count+" newlines"); // finds 5 newlines 
+4
source share
4 answers

The combination of useDelimiter() and next() wrong. useDelimiter("") returns a 1-line substring to next() , because an empty string is really between two characters.

That is, because "\r\n".equals("\r" + "" + "\n") so "\r\n" are actually two tokens, "\r" and "\n" , separated by "" .

To get the Matcher behavior, you need findWithinHorizon , which ignores the delimiters.

  Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE); String sourceString = "\r\n\n\r\r\n\n"; Scanner scan = new Scanner(sourceString); int count = 0; while (scan.findWithinHorizon(newLinePattern, 0) != null) { count++; } System.out.println("found "+count+" newlines"); // finds 5 newlines 

API Links

  • findWithinHorizon(Pattern pattern, int horizon)

    Attempting to find the next occurrence of the specified pattern [...] ignoring delimiters [...] If no such pattern is found, returns null [...] If horizon is 0, then [...] this method continues the search through the input, which Searches for the specified template without binding.

Related Questions

+6
source

This is essentially the expected behavior of both. The scanner primarily takes care of dividing items into tokens using your separator. Therefore, it (lazily) takes your sourceString and treats it as the following set of tokens: \r , \n , \n , \r , \r , \n and \n . When you call hasNext, it checks to see if the next token matches your pattern (which they all do trivially with ? On \r\n? ). Thus, the while loop iterates over each of the 7 tokens.

On the other hand, the match will match the regular expression greedily - so it concatenates \r\n together as you expect.

One way to emphasize Scanner behavior is to change your regular expression to (\\r\\n|\\n) . This results in a count of 0. This is because the scanner reads the first token as \r (not \r\n ), and then notices that it does not match your pattern, so it returns false when calling hasNext .

(Short version: scanner markers that use a separator before using your marker template do not have any form of tokenization)

+3
source

It may be worth mentioning that your example is mixed. It could be:

 \r \n \n \r \r \n \n 

(seven lines)

or

 \r\n \n \r \r\n \n 

(five lines)

What? the quantifier you used is a greedy quantifier, which is likely to make five correct answers, but since the scanner iterates over the tokens (in your case, individual characters, due to the delimitation pattern you choose), he will be reluctant to match one character for times, getting the wrong answer out of seven.

+2
source

When you use a Scanner with a separator "" , it will trigger tokens, each of which will contain one character. This is before applying your new regular expression. He then matches each of these characters with a new regular expression in the string; Each of them corresponds, therefore it gives out 7 tokens. However, since it breaks the string into a 1-character token, it will not group adjacent \r\n characters into one token.

0
source

Source: https://habr.com/ru/post/1310311/


All Articles