Regex to split a string separated by a character | when not enclosed in double quotes

I need a regular expression to count the number of columns in a row with channel restrictions in java. Column data will always be enclosed in double quotes or they will be empty.

eg:

"1234"|"Name"||"Some description with ||| in it"|"Last Column" 

The above should be considered 5 columns, including one empty column after the Name column.

thanks

+6
source share
3 answers

Here is one way to do this:

 String input = "\"1234\"|\"Name\"||\"Some description with ||| in it\"|\"Last Column\""; // \_______/ \______/\/\_________________________________/ \_____________/ // 1 2 3 4 5 int cols = input.replaceAll("\"[^\"]*\"", "") // remove "..." .replaceAll("[^|]", "") // remove anything else than | .length() + 1; // Count the remaining |, add 1 System.out.println(cols); // 5 

IMO this is not very cool. I would not recommend using regular expressions if you plan on handling escaped quotes, for example.

+8
source

Slightly improved expressions in aioobe answer :

 int cols = input.replaceAll("\"(?:[^\"\\]+|\\.)*\"|[^|]+", "") .length() + 1; 

Handles escapes in quotation marks and uses a single expression to remove everything except delimiters.

+2
source

Here, the regex that I used a while ago also deals with escaped quotes AND escaped delimiters . This will probably exceed your requirements (column counting), but perhaps it will help you or someone else in the future with their analysis.

 (?<=^|(?<!\\)\|)(\".*?(?<=[^\\])\"|.*?(?<!\\(?=\|))(?=")?|)(?=\||$) and broken down as: (?<=^|(?<!\\)\|) // look behind to make sure the token starts with the start anchor (first token) or a delimiter (but not an escaped delimiter) ( // start of capture group 1 \".*?(?<=[^\\])\" // a token bounded by quotes | // OR .*?(?<!\\(?=\|))(?=")? // a token not bounded by quotes, any characters up to the delimiter (unless escaped) | // OR // empty token ) // end of capture group 1 (?=\||$) // look ahead to make sure the token is followed by either a delimiter or the end anchor (last token) when you actually use it it'll have to be escaped as: (?<=^|(?<!\\\\)\\|)(\\\".*?(?<=[^\\\\])\\\"|.*?(?<!\\\\(?=\\|))(?=\")?|)(?=\\||$) 

This is complicated, but there is a method for this insanity: other regular expressions that I searched googled will fall if the column at the beginning or end of the line is empty, the delimiting quotes were in odd places, the line or column started or ended with an escaped delimiter and the connection of others scenarios with edge scenes.

The fact that you use the channel as a separator makes this regular expression even more difficult to read / understand. The tip is where you see the pipe itself "|", it is a conditional OR in a regular expression, and when it eludes "\ |", this is your separator.

+1
source

Source: https://habr.com/ru/post/917767/


All Articles