Matching a specific substring to regular expressions using awk

I am dealing with specific file names and must extract information from them.

The file name structure is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"

with RANDOMSTR row with max. 22 characters and which may contain a substring (or not) with the format "-W [0-9]. [0-9] {2}. [0-9] {3}". This substring also has a unique launch feature with -W.

The information I need to extract is the RANDOMSTR substring without this optional substring.

I want to implement this in a bash script, and so far the best option I've found is to use gawk with a regex. My best attempt still fails:

gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045

Expected results:

gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING

How can I get the desired effect.

Thank.

+3
4

look-arounds, , awk/gawk , grep -P .

$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
+2

grep- , OP , -P Linux. awk.

$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$ 

, "20100613_M4_28007834.005_F_OTHER-STRING-W0_40 + 045.raw.gz". , -W , , - :

$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
+1

, (.*) (-W.*)? . - . , regex-fu .

, , , .raw.gz -W*.

str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz}  | # remove trailing .raw.gz
     sed 's/-W.*$//' | # remove trainling -W.*, if any
     sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'

sed, gawk/awk.

0

, :

sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO
0

Source: https://habr.com/ru/post/1780423/


All Articles