How to use awk to retrieve data in nested delimiters using inanimate regular expressions

This question occurs repeatedly in many forms with many different multi-character separators, so IMHO is the canonical answer.

Given the input file, for example:

<foo> .. 1 <foo> .. a<2 .. </foo> .. </foo> <foo> .. @{<>}@ <foo> .. 4 .. </foo> .. </foo> <foo> .. 5 .. </foo>

how do you extract text between nested delimiters start ( <foo>) and end ( </foo>) using an unwanted match with awk?

Desired conclusion (in any order):

<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

Note that the beginning or end can be any multi-character string, and the text between them can be anything other than these lines, including characters that are part of these lines, such as characters <or >in this example.

+4
source share
2

, awk , <foo>.*</foo>, </foo> </foo>. , , , x[^xy]*y, x y - / , , ? : :

$ cat nonGreedy.awk
{
    $0 = encode($0)
    while ( match($0,/({[^{}]*})/) ) {
        print decode(substr($0,RSTART,RLENGTH))
        $0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
    }
}
function encode(str) {
    gsub(/@/,"@A",str)
    gsub(/{/,"@B",str); gsub(/}/,"@C",str)
    gsub(/<foo>/,"{",str); gsub(/<\/foo>/,"}",str)
    return str
}
function decode(str) {
    gsub(/}/,"</foo>",str); gsub(/{/,"<foo>",str)
    gsub(/@C/,"}",str); gsub(/@B/,"{",str)
    gsub(/@A/,"@",str)
    return str
}

$ awk -f nonGreedy.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

, JUST IN THE START/END STRINGS ( , , , ), @ A . @A @, @B @, - .

2 , /, { } @ -prefixed, @B @C, @B {, @C }, { }.

, , , , <foo> , , { </foo> }, {[^{}]*} <foo>.*</foo>.

, , , ( , , ), { <foo> @B {, @A @ .., .

awk. / RE, while(index(substr())) gsub() .

, gawk , 2 , , script :

BEGIN { FPAT="{[^{}]*}" }
{
    $0 = encode($0)
    for (i=1; i<=NF; i++) {
        print decode($i)
    }
}

, / , , , .

, , . fooobar.com/questions/1660332/....

+1

( ) , :

<foo> .. 1                   # second
  <foo> .. a<2 .. </foo> ..  # first in my approach
</foo> 
<foo> .. @{<>}@              # fourth
  <foo> .. 4 .. </foo> ..    # third
</foo> 
<foo> .. 5 .. </foo>         # fifth

arr seps , (), .

Gnu awk ( split ).

EDIT , Gnu awk, gsplit(), Gnu awk split.

$ cat program.awk
{ data=data $0 }                         # append all records to one var
END {
    n=gsplit(data, arr, "</?foo>", seps) # split by every tag
    for(i=1;i<=n;i++) {                  # atm iterate arrays from front to back
        if(seps[i]=="<foo>")             # if element opening tag
            stack[++j]=seps[i] arr[i+1]  # store tag ang wait for closing tag
        else {
            stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
            print stack[j--] seps[i] 
        } 
        prev = seps[i]
    }
}

# elementary gnu awk split compatible replacement
function gsplit(str, arr, pat, seps,    i) {
    delete arr; delete seps; i=0
    while(match(str, pat)) {
        arr[++i]=substr(str,1,(RSTART-1))
        seps[i]=substr(str,RSTART,RLENGTH)
        str=substr(str,(RSTART+RLENGTH))
    }
    arr[++i]=substr(str,1)
    return i
}

:

$ awk -f program.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>
+1

Source: https://habr.com/ru/post/1660329/


All Articles