A very effective solution for matching such inputs is to use the template normal* (special normal*)* ; this name is quoted from a great book by Jeffrey Fridle, Mastering Regular Expressions .
This pattern is useful, in general, for matching inputs consisting of regular entries (normal part) with delimiters between them (special part).
Note that, like all regular expressions, it should be used when there is no better choice; while you can use this template to parse CSV data, for example, if you use Java, you better use OpenCSV instead.
Also note that although the quantifiers in the pattern name are stars (that is, zero or more), you can vary them to suit your needs.
Strings with embedded double quotes
Take the above example; and please note that this sample text can be anywhere in your input:
"He said \"Hello\" to me for the first time"
No matter how hard you try, no "dot plus greedy / lazy quantifiers" magic can help you solve this problem. Instead, classify the input between quotation marks as regular and special:
- normal is something like a backslash or double quote:
[^\\"] ; - special is a backslash sequence followed by a double quote:
\\" .
Substituting this into the normal* (special normal*)* pattern, this gives the following regular expression:
[^\\"]*(\\"[^\\"]*)*
Adding double quotes around to match the full text gives the final regex:
"[^\\"]*(\\"[^\\"]*)*"
You will notice that this will also match empty quotation marks.
Dash Separated Words
Here we will need to use the option for quantifiers, since:
- we donβt need empty words
- We donβt want words to start with dashes,
- when a dash appears, it must have at least one letter in front of the other dash, if any.
For simplicity, we also assume that only lowercase ASCII characters are allowed.
Input Example:
the-word-to-match
We decompose again into normal and special:
- normal: lowercase, ASCII letter:
[az] ; - special: dash:
-
The canonical form of the template will be:
[az]*(-[az]*)*
But, as we said:
- we do not need words starting with a dash: the first
* should become + ; - when a dash is found, after it must be at least one letter: the second
* must become + .
As a result, we get:
[az]+(-[az]+)*
Adding anchor words around it to get the final result:
\b[az]+(-[az]+)*\b
Other operator variations
The above examples limit the replacement of * to + , but of course you can have as many options as you like. One ultra-classic example is an IP address:
- normal - up to three digits (
\d{1,3} ), - special is the point: (
\. ), - the first
normal appears only once, so no quantifier normal inside (special normal*) also appears only once, so the quantifier- finally, the
(special normal*) displayed exactly three times, so {3} .
What gives expression (decorated with questionnaires of words):
\b\d{1,3}(\.\d{1,3}){3}\b
Conclusion
This template flexibility makes it one of the most useful tools in your regex tool. Although there are many problems in which you should not use regular expressions if libraries exist, in some situations you need to use regular expressions. And this will become one of your best friends as soon as you work a little with him!
Tips
- It is more than likely that you do not need (or want) to capture the repeating part (the
(special normal*) ); therefore, it is recommended that you use a non-capture group. For example, use "[^\\"]*(?:\\"[^\\"]*)*" for quoted strings. In fact, if you wanted to, capturing would almost never produce the desired results in this case, because a capture group repetition will only ever give you last (all previous repetitions will be overwritten) unless you use this pattern in .NET. (thanks @ohaal)