How can I identify a space that is not specified or with double quotes

I am trying to create a Java regex that will replace all occurrences of a space in a string with a single space, except that this space occurs between quotation marks (single or double)

If I was just looking for double quotes, I could look ahead:

text.replaceAll("\\s+ (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", " "); 

And if I was just looking for single quotes, I could use a similar pattern.

The trick finds both.

I had a great idea to run a double quote pattern, followed by a single quote pattern, but of course this replaced all the spaces regardless of the quotes.

So, here are some tests and expected results

 abcde --> abcde ab "cd" e --> ab "cd" e ab 'cd' e --> ab 'cd' e ab "cd' e --> ab "cd' e (Can't mix and match quotes) 

Is there a way to accomplish this in Java regex?

Assume that the invalid input is already checked separately. Therefore, none of the following will happen:

 a "bc ' d a 'b " c' d a 'bcd 
+5
source share
4 answers

EDIT - Note - This answer has an error / defect

It takes one space between the ending quotation mark ( " or ' ) and the characters following it to correctly match quoted strings. So " "some-text will not be correctly processed by this answer.

This may have more flaws - but this is one.

EDIT - Alternative Answer

I added another more optimized answer in which there is no error.

Leaving it here for posterity.

supports

This one supports escaping quotes through \" and \' and multi-line quotes.

Regular expression

 ([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+) 

https://regex101.com/r/wT6tU2/1

replacement

$1$2 (yes, there is a space at the end)

Visualization

enter image description here

Code

 try { String resultString = subjectString.replaceAll("([^\\s\"'\\\\]+)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*(\\s+)", "$1$2 "); } catch (PatternSyntaxException ex) { // Syntax error in the regular expression } catch (IllegalArgumentException ex) { // Syntax error in the replacement text (unescaped $ signs?) } catch (IndexOutOfBoundsException ex) { // Non-existent backreference used the replacement text } 

Human readable

 // ([^\s"'\\]+)*("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*(\s+) // // Options: Case sensitive; Exact spacing; Dot doesnt match line breaks; ^$ dont match at line breaks; Default line breaks; Regex syntax only // // Match the regex below and capture its match into backreference number 1 «([^\s"'\\]+)*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*» // Or, if you dont want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient. // Match any single character NOT present in the list below «[^\s"'\\]+» // Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» // A "whitespace character" (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s» // A single character from the list ""'" «"'» // The backslash character «\\» // Match the regex below and capture its match into backreference number 2 «("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*» // Or, if you dont want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient. // Match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"» // Match the character """ literally «"» // Match any single character NOT present in the list below «[^"\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character """ «"» // The backslash character «\\» // Match the regular expression below «(?:\\.[^"\\]*)*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // Match the backslash character «\\» // Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.» // Match any single character NOT present in the list below «[^"\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character """ «"» // The backslash character «\\» // Match the character """ literally «"» // Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'» // Match the character "'" literally «'» // Match any single character NOT present in the list below «[^'\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character "'" «'» // The backslash character «\\» // Match the regular expression below «(?:\\.[^'\\]*)*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // Match the backslash character «\\» // Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.» // Match any single character NOT present in the list below «[^'\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character "'" «'» // The backslash character «\\» // Match the character "'" literally «'» // Match the regex below and capture its match into backreference number 3 «(\s+)» // Match a single character that is a "whitespace character" (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+» // Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» 
+3
source

I would recommend standardizing string encapsulation. use regular expression to replace an alternative to the standard. let's say you agree to double quotes “that you can divide the string by” and all your odd elements are quoted contents, and your even elements will not be sorted, start replacing the regular expression only with even elements and rebuild the string from the modified array.

+1
source

edit: Since @DeanTaylor fixed his regex, I will fix it (change)
in case someone decides to use it on unbalanced quotes.

In the initial test for balanced quotes was an atomic group.
I never added it to the parsing logic. So this has been added. And so it is.


You can either match quotation marks or spaces in alternation, and determine which group is suitable for deciding what to replace.

OR use this regular expression to get both, avoiding a decision.

Find: \G((?>"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[^"'\s]+)*)\s+

"\\G((?>\"(?:\\\\[\\S\\s]|[^\"\\\\])*\"|'(?:\\\\[\\S\\s]|[^'\\\\])*'|[^\"'\\s]+)*)\\s+"

Replace: $1<space>

Formatted and tested:

  \G # Must match where last match left off # (This will stop the match if there is a quote unbalance) ( # (1 start), quotes or non-whitespace (?> # Atomic cluster to stop backtracking if quote unbalance " (?: \\ [\S\s] | [^"\\] )* # Double quoted text " | # or, ' (?: \\ [\S\s] | [^'\\] )* # Single quoted text ' | # or, [^"'\s]+ # Not quotes nor whitespace )* # End Atomic cluster, do 0 to many times ) # (1 end) \s+ # The whitespaces outside of quotes 

Note. You can check the string for balanced quotes before using the above regular expression.
This will check the line, if it passes, it has balanced quotes.

^(?>(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*')|[^"']+)+$

"^(?>(?:\"(?:\\\\[\\S\\s]|[^\"\\\\])*\"|'(?:\\\\[\\S\\s]|[^'\\\\])*')|[^\"']+)+$"


Updating new @DeanTaylor response tests.

Example 1 - for a string Word1 Word2 (two spaces between words)

  • This version takes ~ 27 steps
  • @DeanTaylor version takes ~ 29 steps

Example 2 - for the string 'example' another_word (two spaces between words)

  • This version takes ~ 51 steps
  • The @DeanTaylor version takes ~ 36 steps (presumably due to an unmapped loop)

Example 3 - for a WordPress file

  • This version takes ~ 315,647 steps
  • The @DeanTaylor version takes 122,701 steps (the Dean version does not process one space)

Tests Niether Example 3 will provide a permanent link to regex101.com.
The page becomes unresponsive, showing what kind of garbage it is.

+1
source

Support

  • escaping quotes through \" and \' and multi-line quotes.
  • inconsistent quotation marks, where quotation marks end at the end of a line.
  • additional optimizations for large files

Optimization

Several optimizations to reduce the number of steps:

Example 1 - for a string Word1 Word2 (two spaces between words)

Example 2 - for the string 'example' another_word (two spaces between words)

Example 3 - for WordPress /wp-includes/media.php file

Regular expression

 \G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+) 

https://regex101.com/r/wT6tU2/4

Replacement

$1 (yes, there is a space at the end)

Visualization

RegEx renderings

code

 try { String resultString = subjectString.replaceAll("\\G((?:[^\\s\"']+| (?!\\s)|\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\.[^'\\\\]*)*')*+)(\\s+)", "$1 "); } catch (PatternSyntaxException ex) { // Syntax error in the regular expression } catch (IllegalArgumentException ex) { // Syntax error in the replacement text (unescaped $ signs?) } catch (IndexOutOfBoundsException ex) { // Non-existent backreference used the replacement text } 

Human reading

 // \G((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)(\s+) // // Options: Case sensitive; Exact spacing; Dot doesn't match line breaks; ^$ don't match at line breaks; Default line breaks; Regex syntax only // // Assert position at the end of the previous match (the start of the string for the first attempt) «\G» // Match the regex below and capture its match into backreference number 1 «((?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+)» // Match the regular expression below «(?:[^\s"']+| (?!\s)|"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')*+» // Between zero and unlimited times, as many times as possible, without giving back (possessive) «*+» // Match this alternative (attempting the next alternative only if this one fails) «[^\s"']+» // Match any single character NOT present in the list below «[^\s"']+» // Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» // A "whitespace character" (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s» // A single character from the list ""'" «"'» // Or match this alternative (attempting the next alternative only if this one fails) « (?!\s)» // Match the character " " literally « » // Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\s)» // Match a single character that is a "whitespace character" (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s» // Or match this alternative (attempting the next alternative only if this one fails) «"[^"\\]*(?:\\.[^"\\]*)*"» // Match the character """ literally «"» // Match any single character NOT present in the list below «[^"\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character """ «"» // The backslash character «\\» // Match the regular expression below «(?:\\.[^"\\]*)*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // Match the backslash character «\\» // Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.» // Match any single character NOT present in the list below «[^"\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character """ «"» // The backslash character «\\» // Match the character """ literally «"» // Or match this alternative (the entire group fails if this one fails to match) «'[^'\\]*(?:\\.[^'\\]*)*'» // Match the character "'" literally «'» // Match any single character NOT present in the list below «[^'\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character "'" «'» // The backslash character «\\» // Match the regular expression below «(?:\\.[^'\\]*)*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // Match the backslash character «\\» // Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.» // Match any single character NOT present in the list below «[^'\\]*» // Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» // The literal character "'" «'» // The backslash character «\\» // Match the character "'" literally «'» // Match the regex below and capture its match into backreference number 2 «(\s+)» // Match a single character that is a "whitespace character" (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+» // Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» 
+1
source

Source: https://habr.com/ru/post/1238571/


All Articles