How to use regular expressions, how to efficiently match strings between double quotes with inline double quotes?

Question

How to use regular expressions, how to efficiently match strings between double quotes with inline double quotes?

We have text in which we want to match all strings between double quotes; but double quotes can be specified in these double quotes. Example:

"He said \"Hello\" to me for the first time"

With regular expressions, how do you match this result?

+6

regex

fge Jun 11 '13 at 11:54

source share

1 answer

fge · Answer 1 · 2013-06-11T12:01:34+0000

A very effective solution for matching such inputs is to use the template normal* (special normal*)* ; this name is quoted from a great book by Jeffrey Fridle, Mastering Regular Expressions .

This pattern is useful, in general, for matching inputs consisting of regular entries (normal part) with delimiters between them (special part).

Note that, like all regular expressions, it should be used when there is no better choice; while you can use this template to parse CSV data, for example, if you use Java, you better use OpenCSV instead.

Also note that although the quantifiers in the pattern name are stars (that is, zero or more), you can vary them to suit your needs.

Strings with embedded double quotes

Take the above example; and please note that this sample text can be anywhere in your input:

 "He said \"Hello\" to me for the first time"

No matter how hard you try, no "dot plus greedy / lazy quantifiers" magic can help you solve this problem. Instead, classify the input between quotation marks as regular and special:

normal is something like a backslash or double quote: [^\\"] ;
special is a backslash sequence followed by a double quote: \\" .

Substituting this into the normal* (special normal*)* pattern, this gives the following regular expression:

 [^\\"]*(\\"[^\\"]*)*

Adding double quotes around to match the full text gives the final regex:

 "[^\\"]*(\\"[^\\"]*)*"

You will notice that this will also match empty quotation marks.

Dash Separated Words

Here we will need to use the option for quantifiers, since:

we don’t need empty words
We don’t want words to start with dashes,
when a dash appears, it must have at least one letter in front of the other dash, if any.

For simplicity, we also assume that only lowercase ASCII characters are allowed.

Input Example:

 the-word-to-match

We decompose again into normal and special:

normal: lowercase, ASCII letter: [az] ;
special: dash: -

The canonical form of the template will be:

 [az]*(-[az]*)*

But, as we said:

we do not need words starting with a dash: the first * should become + ;
when a dash is found, after it must be at least one letter: the second * must become + .

As a result, we get:

 [az]+(-[az]+)*

Adding anchor words around it to get the final result:

 \b[az]+(-[az]+)*\b

Other operator variations

The above examples limit the replacement of * to + , but of course you can have as many options as you like. One ultra-classic example is an IP address:

normal - up to three digits ( \d{1,3} ),
special is the point: ( \. ),
the first normal appears only once, so no quantifier
normal inside (special normal*) also appears only once, so the quantifier
finally, the (special normal*) displayed exactly three times, so {3} .

What gives expression (decorated with questionnaires of words):

 \b\d{1,3}(\.\d{1,3}){3}\b

Conclusion

This template flexibility makes it one of the most useful tools in your regex tool. Although there are many problems in which you should not use regular expressions if libraries exist, in some situations you need to use regular expressions. And this will become one of your best friends as soon as you work a little with him!

Tips

It is more than likely that you do not need (or want) to capture the repeating part (the (special normal*) ); therefore, it is recommended that you use a non-capture group. For example, use "[^\\"]*(?:\\"[^\\"]*)*" for quoted strings. In fact, if you wanted to, capturing would almost never produce the desired results in this case, because a capture group repetition will only ever give you last (all previous repetitions will be overwritten) unless you use this pattern in .NET. (thanks @ohaal)

How to use regular expressions, how to efficiently match strings between double quotes with inline double quotes?

Strings with embedded double quotes

Dash Separated Words

Other operator variations

Conclusion

Tips

More articles: