What is the difference between [\ s \ S] *? and.*? in Java regular expressions?

I designed a regex to identify the xml block inside a text file. The expression is as follows (I removed all the slashes to remove java so that they are easy to read):

<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*> 

Then I optimized it and replaced [\s\S]*? on .*? . He suddenly stopped recognizing xml.

As far as I know, \s means that all space characters and \s mean all characters with non-white space or [^\s] , so [\s\S] should logically be equivalent . I have not used greedy filters, so what could be the difference?

+5
source share
2 answers

Regular expression expressions . and \s\S not equivalent since . by default does not use line terminators (for example, a new line).

According to oracle website . corresponds to

Any character (may or may not match string terminators)

while the line terminator is any of the following:

  • Newline character (string) ( '\n' ),
  • A carriage return character followed immediately by a newline character ( "\r\n" ),
  • Standalone carriage return character ( '\r' ),
  • The next character ( '\u0085' ),
  • Line Separator Character ( '\u2028' ) or
  • Paragraph separator character ( '\u2029 ).

Two expressions are not equivalent if the necessary flags are not set. Again quoting the oracle site:

If UNIX_LINES activated, then single line delimiters recognize newline characters.

Regular expression . matches any character except a string if the DOTALL flag is not specified.

+6
source

Here is a sheet explaining all regex commands.

Basically, \s\S will display all characters, including newlines. While . does not set default line terminators (for some flags it is necessary to set them).

+2
source

Source: https://habr.com/ru/post/1242464/


All Articles