Matching regular expression spaces, but not in "strings"

Question

Matching regular expression spaces, but not in "strings"

I look for regular matching spaces only if these spaces are not enclosed in double quotation marks ("). For example, in

Mary had "a little lamb"

it must match the first second space, but not the other.

I want to split the string only on spaces that are not in double quotes, and not in quotes.

I use C ++ with Qt toolkit and wanted to use QString :: split (QRegExp). QString is very similar to std :: string, and QRegExp is basically a POSIX expression encapsulated in a class. If such a regular expression exists, the split will be trivial.

Examples:

 Mary had "a little lamb" => Mary,had,"a little lamb" 1" 2 "3 => 1" 2 "3 (no splitting at ") abc def="ghi" "jk" = 12 => abc,def="ghi","jk",=,12

Sorry for the changes, I was very inaccurate when I asked the question first. Hopefully this will become clearer now.

+4

c ++ c regex qt

Gunther piez Aug 21 '09 at 7:27

source share

5 answers

What should happen to "a" b "c" ?

Note that in the substring " b " spaces are between quotation marks.

- change -

I assume that the space is “between quotation marks” if preceded by an odd number of standard quotes (ie U + 0022, I will ignore these funny “Unicode quotes”).

This means that you need the following regular expression: ^[^"]*("[^"]*"[^"]*)*"[^"]* [^"]*"[^"]*("[^"]*"[^"]*)*$

("[^"]*"[^"]*) represents a pair of quotation marks. ("[^"]*"[^"]*)* represents an even number of quotes, ("[^"]"[^"]*)*" an odd number. Then there is an actual string with quotes, followed by another one odd number of quotes. ^$ need anchors because you need to count every quote from the beginning of the line. This answers the problem of substring " b " above without ever looking at substrings. The price is that each character of your input must be matched with the whole line, which turns this into an O (N * N) split operation.

The reason you can do this in regex is because a limited amount of memory is required. Only one bit is effective; "Have I seen a strange or even number of quotes so far?" In fact, you do not need to match the individual pairs "" .

However, this is not the only interpretation. If you have included "funny Unicode quotes" that should be paired, you also need to deal with the lines ""double quoted"" . This, in turn, means that you need the number of open ones, " which means you need endless storage, which in turn means that it is no longer an ordinary language, which means that you cannot use the regular expression. What and it was required to prove.

In any case, even if it was possible, you still need the right parser. The behavior of O (N * N) to count the number of quotes preceding each character is simply not funny. If you already know that there are quotation marks X preceding Str [N], this should be an O (1) operation to determine how many quotes precede Str [N + 1], and not O (N). The possible answers are simply X or X + 1!

+4

Msalters Aug 21 '09 at 7:49

source share

MSalters pushed me on the right track. The problem with his answer is that the regular expression that he gives always matches the entire string and is therefore unsuitable for split (), but this can be partially redeemed using a match match. Assuming that the quotes are always paired (they really are), I can divide into each space, followed by an even number of quotes.

Regular expression without C-screens and in single quotes looks like

 ' (?=[^"]*("[^"]*"[^"]*)*$)'

In the source code, it finally looked (using Qt and C ++)

 QString buf("Mary had \"a little lamb\""); // string we want to split QStringList splitted = buf.split( QRegExp(" (?=[^\"]*(\"[^\"]*\"[^\"]*)*$)") );

Simple, eh?

For performance, the lines are analyzed once at the beginning of the program; they are several tens and less than one hundred characters. I will spend its runtime with long lines, just to make sure nothing bad happens; -)

+4

Gunther piez Aug 21 '09 at 15:41

source share

If quoting in strings is simple (like your examples), you can use alternation. This regular expression first hunts for a simple quoted string; otherwise it will find spaces.

 /(\"[^\"]*\"| +)/

In Perl, if you use grouping in a regular expression when you call split() , the function returns not only the elements, but also the captured groups (in this case, our separator). If you then filter the space and space separators, you will get the list of items you need. I don’t know if a similar strategy will work in C ++, but the following Perl code works:

 use strict; use warnings; while (<DATA>){ chomp; my @elements = split /(\"[^\"]*\"| +)/, $_; @elements = grep {length and /[^ ]/} @elements; # Do stuff with @elements } __DATA__ Mary had "a little lamb" 1" 2 "3 abc def="ghi" "jk" = 12

+1

Fmc Aug 21 '09 at 13:27

source share

The simplest regular expression: matching whole spaces and quotes. Later filter quotes

 "[^"]*"|\s

-2

maykeye Aug 21 '09 at 7:54

source share

Alan moore · Accepted Answer · 2009-08-21T15:52:26+0000

(I know that you just sent almost exactly the same answer, but I can't just throw it all away .: - /)

If you manage to solve your problem with the regex separation operation, the regex will have to match an even number of quotes, as MSalters said. However, the split regex should only match the spaces you split, so the rest of the work should be done in the form. Here is what I will use:

 " +(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"

If the text is well-formed, for an even number of quotes it’s enough to look at it to determine that the space just coinciding is not inside the indicated sequence. That is, lookbehind is not needed, which is good, because QRegExp does not seem to support them. Quoted quotes can also be placed, but the regex gets a little big and ugly. But if you cannot be sure that the text is well-formed, it is highly unlikely that you can solve your problem with split() .

By the way, QRegExp does not implement POSIX regular expressions - if that were the case, it would not support lookaheads OR lookbehinds. Instead, it falls into the freely defined category of Perl compatible regular expressions.

Matching regular expression spaces, but not in "strings"

More articles: