How to use awk to retrieve data in nested delimiters using inanimate regular expressions

Question

How to use awk to retrieve data in nested delimiters using inanimate regular expressions

This question occurs repeatedly in many forms with many different multi-character separators, so IMHO is the canonical answer.

Given the input file, for example:

<foo> .. 1 <foo> .. a<2 .. </foo> .. </foo> <foo> .. @{<>}@ <foo> .. 4 .. </foo> .. </foo> <foo> .. 5 .. </foo>

how do you extract text between nested delimiters start ( <foo>) and end ( </foo>) using an unwanted match with awk?

Desired conclusion (in any order):

<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

Note that the beginning or end can be any multi-character string, and the text between them can be anything other than these lines, including characters that are part of these lines, such as characters <or >in this example.

+4

awk

Ed morton Nov 09 '16 at 17:14

source share

2

( ) , :

<foo> .. 1                   # second
  <foo> .. a<2 .. </foo> ..  # first in my approach
</foo> 
<foo> .. @{<>}@              # fourth
  <foo> .. 4 .. </foo> ..    # third
</foo> 
<foo> .. 5 .. </foo>         # fifth

arr seps , (), .

Gnu awk ( split ).

EDIT , Gnu awk, gsplit(), Gnu awk split.

$ cat program.awk
{ data=data $0 }                         # append all records to one var
END {
    n=gsplit(data, arr, "</?foo>", seps) # split by every tag
    for(i=1;i<=n;i++) {                  # atm iterate arrays from front to back
        if(seps[i]=="<foo>")             # if element opening tag
            stack[++j]=seps[i] arr[i+1]  # store tag ang wait for closing tag
        else {
            stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
            print stack[j--] seps[i] 
        } 
        prev = seps[i]
    }
}

# elementary gnu awk split compatible replacement
function gsplit(str, arr, pat, seps,    i) {
    delete arr; delete seps; i=0
    while(match(str, pat)) {
        arr[++i]=substr(str,1,(RSTART-1))
        seps[i]=substr(str,RSTART,RLENGTH)
        str=substr(str,(RSTART+RLENGTH))
    }
    arr[++i]=substr(str,1)
    return i
}

:

$ awk -f program.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

+1

James Brown 09 . '16 22:07

Ed Morton · Accepted Answer · 2016-11-09T17:26:03+0000

, awk , <foo>.*</foo>, </foo> </foo>. , , , x[^xy]*y, x y - / , , ? : :

$ cat nonGreedy.awk
{
    $0 = encode($0)
    while ( match($0,/({[^{}]*})/) ) {
        print decode(substr($0,RSTART,RLENGTH))
        $0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
    }
}
function encode(str) {
    gsub(/@/,"@A",str)
    gsub(/{/,"@B",str); gsub(/}/,"@C",str)
    gsub(/<foo>/,"{",str); gsub(/<\/foo>/,"}",str)
    return str
}
function decode(str) {
    gsub(/}/,"</foo>",str); gsub(/{/,"<foo>",str)
    gsub(/@C/,"}",str); gsub(/@B/,"{",str)
    gsub(/@A/,"@",str)
    return str
}

$ awk -f nonGreedy.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

, JUST IN THE START/END STRINGS ( , , , ), @ A . @A @, @B @, - .

2 , /, { } @ -prefixed, @B @C, @B {, @C }, { }.

, , , , <foo> , , { </foo> }, {[^{}]*} <foo>.*</foo>.

, , , ( , , ), { <foo> @B {, @A @ .., .

awk. / RE, while(index(substr())) gsub() .

, gawk , 2 , , script :

BEGIN { FPAT="{[^{}]*}" }
{
    $0 = encode($0)
    for (i=1; i<=NF; i++) {
        print decode($i)
    }
}

, / , , , .

, , . fooobar.com/questions/1660332/....

How to use awk to retrieve data in nested delimiters using inanimate regular expressions

More articles: