How to find all words appearing between `\ word {}` in BASH?

I have a file like this:

This \word{is} some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}. 

I need to create a list of all the text that appears between \word{ and the corresponding closing bracket } , for example:

 is more text This yet text 
  • Opening and closing curly braces are always displayed on the same line, without crossing several lines.
  • There are other curly braces in the document, but not a single \word{} inside \word{} .

How to print a list of all the text displayed in \word{} ?

+4
source share
10 answers

grep with PCRE capabilities will do the job:

 grep -Po "(?<=\\word{)[^}]*(?=})" file 

Live Demo: http://ideone.com/uzEzBF

+7
source

You seem to be processing a TeX file ... so why not use TeX for this? Then you will be sure that there will be no problems and side effects, for example,

 \word {there a space between \verb=\word= and the curly bracket} 

it will work anyway! It will still work for multi-line broadcasting:

 \word{this is a multiline stuff \emph{and you can even add more groupings in it,} it'll still work fine!} 

In the (La) TeX preamble, simply add:

 \newwrite\file \immediate\openout\file=output.txt \def\word#1{\immediate\write\file{#1}} 

or use \newcommand if you are using LaTeX and not plainTeX.

You can also put \immediate\write\file{#1} in your \word definition macro. If you do not have access to the \word macro (for example, in a class or style file), you can:

 \let\oldword\word \def\word#1{\immediate\write\file{#1}\oldword{#1}} 

Hope this helps!

+9
source

A clean bash solution without calling any external utilities:

 while read -rx; do while [[ $x =~ \\word{([^}]+)} ]]; do echo ${BASH_REMATCH[1]} x=${x#*$BASH_REMATCH} done done <infile 

Input file:

 $ cat infile This \word{is} some text. {This \word{is}}some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}. 

Output:

 is is more text This yet text 

The trick is the -r option, a built-in function installed in read . This will not refer to \ as an escape character in the read string. Then it loops until the \word{...} pattern is in the string. Then an internal matching line is printed, and input compression is interrupted.

For small files (1-2 MB) I will use this version because it uses very minimal resources. But for large files, I suggest using anubhava - , since it reads the file much more efficiently!

+4
source

Since not all grep versions have PCRE, here is a solution using only extended regular expression.

grep -Eo "\\word{.+}" file_name | sed -e "s/\\word{//" -e "s/}//"

+3
source
 $ cat testfile This \word{is} some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}. $ awk '$0 ~ /\\word{[^}]*}/ { nelts = split($0, arr, /\\word{/); for (i=1; i <= nelts; i++) if (arr[i] ~ /^[^}]*}/) print substr(arr[i], 1, index(arr[i], "}") - 1); }' testfile is more text This yet text 

If it were \word{\word{STRING}} , STRING would be printed. In other words, it works recursively. Sorry if this is not what you wanted.

+1
source

Mix grep and sed:

 egrep -o '\\word\{[^\{\}]+\}' | sed 's/\\word{//;s/}//' 

For fun, I also compiled a clean version of bash:

 while read -rl do n=${#l} ll="${l#*\\word{}" while [ $n -ne ${#ll} ] do echo "${ll%%\}*}" n=${#ll} ll="${ll#*\\word{}" done done 

Not very clean, but it works on your example

+1
source

Code for GNU :

 sed -nr ':b;s/(\\word\{[^}]+\})/\1\n/;s/.*\\word\{([^}]+)\}\n/\1\n/;T;P;D;tb' file 

  $ cat file
 This \ word {is} some text.
 This is some \ word {more text}.
 \ word {This} is \ word {yet} some more \ word {text}.
 {\ word {This} is \ word {yet} {some} more \ word {text}.}

 $ sed -nr ': b; s / (\\ word \ {[^}] + \}) / \ 1 \ n /; s /.* \\ word \ {([^}] +) \} \ n / \ 1 \ n /; T; P; D; tb 'file
 is
 more text
 This
 yet
 text
 This
 yet
 text
+1
source

awk was invented for word processing:

 $ awk 'sub(/.*\\word{/,"")' RS='}' file is more text This yet text is $ cat file This \word{is} some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}. { This \word{is} some text } 
+1
source

perl may also help:

 perl -nlE 'say "$_" for (m/\\word\{(.*?)\}/g);' < tex.txt 

for this input:

 This{ \word{is}} some text. This is some \word{more text}. This is {some \word{aaa text}} This is {some \word{bbb text} This is some \word{ccc text}} This is some {\word{ddd text}} {\word{This} is \word{yet} some more \word{text}.} 

prints:

 is more text aaa text bbb text ccc text ddd text This yet text 
+1
source

With sed :

 sed 's/.*\\word{\([^}]*\)}.*/\1/g' input.txt 

The expression above deletes everything except what is inside the brackets. If in the future it turns out that you need to match multiple lines, awk might be simpler:

 awk -F "\\word{" 'BEGIN { RS = "}" } { print $2 }' input.txt 

This sets \word{ as a field separator and } as a record separator, implying that $2 refers to what's inside the brackets.

+1
source

Source: https://habr.com/ru/post/1490105/


All Articles