Regex matches values ​​not surrounded by another char?

This is one of the hardest things I've ever tried to do. I’ve searched over the years, but I just can’t find a way to do this & mdash; match a string not surrounded by the given char, for example, quotation marks or more / less characters.

Such a regular expression can match URLs not associated with HTML links, SQL table.column values ​​not in quotation marks, and more.

Example with quotes: Match [THIS] and "something with [NOT THIS] followed by" or even [THIS]. Example with <,>, & " Match [URL] and <a href="[NOT URL]">or [NOT URL]</a> Example with single quotes: WHERE [THIS] LIKE '%[NOT THIS]' 

Basically, how do you match a string (THIS) when it is not surrounded by a given char?

 \b(?:[^"'])([^"']+)(?:[^"'])\b 

Here is a test pattern: the regex, as I think, will only match the first "quote".

To quote: "quote me so that I do not quote you!"

+7
regex
Jul 28 '09 at 0:48
source share
6 answers

The best solution will depend on what you know about input. For example, if you are looking for things that are not enclosed in double quotes, does this mean that double quotes will always be correctly balanced? Can they be avoided with backslashes or by enclosing them in single quotes?

Assuming the simplest case is no nesting, no shielding, you can use this kind of:

 preg_match('/THIS(?=(?:(?:[^"]*+"){2})*+[^"]*+\z)/') 

After finding the target (THIS), lookahead basically calculates double quotes from this point to the end of the line. If there is an odd number, the match should have occurred inside a pair of double quotes, so it is invalid (lookahead failed).

As you have discovered, this problem is not suitable for regular expressions; that all proposed solutions depend on features that are not found in real regular expressions, such as group capture, search queries, reluctant and possessive quantifiers. I would not even try this without possessive quantifiers or atomic groups .

EDIT: To deploy this solution to allow for double quotes that can be escaped with backslashes, you just need to replace the parts of the regex that match "nothing that is double quote":

 [^"] 

with “anything that is not a quote or backslash, or backslash followed by anything”:

 (?:[^"\\]|\\.) 

Since backslash escape sequences are relatively rare, it is worth matching as many unescaped characters as possible while you are in this part of the regular expression:

 (?:[^"\\]++|\\.) 

Putting it all together, the regex becomes:

 '/THIS\d+(?=(?:(?:(?:[^"\\]++|\\.)*+"){2})*+(?:[^"\\]++|\\.)*+$)/' 

Applies to your test string:

 'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" ' + 'but \"THIS6\" is good and \\\\"NOT THIS7\\\\".' 

... it must match 'THIS1' , 'THIS3' , 'THIS4' and 'THIS6' .

+14
Jul 28 '09 at 2:05
source share

This is a bit complicated. There are ways if you do not need to monitor the nest. For example, avoid quoted materials:

 ^((?:[^"\\]|\\.|"(?:[^"\\]|\\.)*")*?)THIS 

Or by explaining:

 ^ Match from the beginning ( Store everything from the beginning in group 1, if I want to do replace (?: Non-grouping aggregation, just so I can repeat it [^"\\] Anything but quote or escape character | or... \\. Any escaped character (ie, \", for example) | or... " A quote, followed by... (?: ...another non-grouping aggregation, of... [^"\\] Anything but quote or escape character | or... \\. Any escaped character )* ...as many times as possible, followed by... " A (closing) quote )*? As many as necessary, but as few as possible ) And this is the end of group 1 THIS Followed by THIS 

Now there are other ways to do this, but maybe not so flexible. For example, if you want to find THIS until there is a previous sequence "//" or "#" - in other words, THIS is outside the comment, you can do it like this:

 (?<!(?:#|//).*)THIS 

Here (?<!...) is the negative appearance. It will not match these characters, but it will verify that they are not displayed before.

As with any arbitrarily nested structures - n ( closed n ) , for example - they cannot be represented by regular expressions. Perl can do this, but it is not a regular expression.

+3
Jul 28 '09 at 1:19
source share

Well, regular expressions are just the wrong tool for this, so it’s natural that it’s difficult.

Things “surrounded” by other things are not the right rules for regular grammars. Most (we can say, all serious) markups and programming languages ​​are not regular. Until nesting is involved, you can model a parser with a regular expression, but remember to understand what you are doing.

For HTML / XML, just use HTML or. XML parser they exist for almost any language or web structure; their use usually includes several lines of code. For tables, you can use the CSV parser or, in extreme cases, collapse your own parser, which extracts parts inside / outside the quotes. After extracting the parts you are interested in, you can use simple string comparisons or regular expressions to get the results.

+1
Jul 28 '09 at 1:38
source share

See the Text :: Balanced for Perl and Perl FAQ .

+1
Jul 29 '09 at 2:50
source share

Thinking about the nesting elements ("a" this and "this" ") and backslhed items" \ "THIS \", it seems that this is true, that this is not a task for regular expression. what I can come up with to solve this problem will be a regular expression, for example a char -by-char parser that will mark $ quote_level = ###; When searching for and entering a valid quote or subcategory. in that part of the line that you know if you were inside any given character, even if it is reset with a slash or something else.

I think with a char -by-char parser like this, you could mark the position of the string of quotation marks beginning and end so that you can split the string into segments of quotation marks and process only those that are out of quotation marks.

Here is an example of how this analyzer should be smart enough to handle nested levels.

 Match THIS and "NOT THIS" but THIS and "NOT "THIS" or NOT THIS" but \"THIS\" is good. //Parser "greedy" looking for nested levels Match THIS and " NOT THIS" but THIS and " NOT " THIS" or NOT THIS" but \"THIS\" is good //Parser "ungreedy" trying to close nested levels Match THIS and " " but THIS and " " THIS " " but \"THIS\" is good. NOT THIS NOT or NOT THIS //Parser closing levels correctly. Match THIS and " " but THIS and " " but \"THIS\" is good. NOT THIS NOT " " or NOT THIS THIS 
0
Jul 29 '09 at 2:45
source share

As Alan M pointed out, you can use the regular expression to search for an odd number, thereby informing you of your position inside or outside any line. Taking the example of quotation marks, we seem to be very close to solving this problem. It remains only to process the escaped quotes. (I'm sure nested quotes are almost impossible).

 $string = 'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" but \"THIS6\" is good and \\\\"NOT THIS7\\\\".'; preg_match_all('/[^"]+(?=(?:(?:(?:[^"\\\]++|\\\.)*+"){2})*+(?:[^"\\\]++|\\\.)*+$)/', $string, $matches); Array ( [0] => Match THIS1 and [1] => but THIS3 and [2] => THIS4 [3] => but [4] => THIS6 [5] => is good and \\ [6] => NOT THIS7\ [7] => . ) 
0
Jul 29 '09 at 22:44
source share



All Articles