How to find all words appearing between `\ word {}` in BASH?

Question

How to find all words appearing between `\ word {}` in BASH?

I have a file like this:

This \word{is} some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}.

I need to create a list of all the text that appears between \word{ and the corresponding closing bracket } , for example:

 is more text This yet text

Opening and closing curly braces are always displayed on the same line, without crossing several lines.
There are other curly braces in the document, but not a single \word{} inside \word{} .

How to print a list of all the text displayed in \word{} ?

+4

bash

Village Jul 07 '13 at 6:47

source share

10 answers

You seem to be processing a TeX file ... so why not use TeX for this? Then you will be sure that there will be no problems and side effects, for example,

 \word {there a space between \verb=\word= and the curly bracket}

it will work anyway! It will still work for multi-line broadcasting:

 \word{this is a multiline stuff \emph{and you can even add more groupings in it,} it'll still work fine!}

In the (La) TeX preamble, simply add:

 \newwrite\file \immediate\openout\file=output.txt \def\word#1{\immediate\write\file{#1}}

or use \newcommand if you are using LaTeX and not plainTeX.

You can also put \immediate\write\file{#1} in your \word definition macro. If you do not have access to the \word macro (for example, in a class or style file), you can:

 \let\oldword\word \def\word#1{\immediate\write\file{#1}\oldword{#1}}

Hope this helps!

+9

gniourf_gniourf Jul 11 '13 at 10:52

source share

A clean bash solution without calling any external utilities:

 while read -rx; do while [[ $x =~ \\word{([^}]+)} ]]; do echo ${BASH_REMATCH[1]} x=${x#*$BASH_REMATCH} done done <infile

Input file:

 $ cat infile This \word{is} some text. {This \word{is}}some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}.

Output:

 is is more text This yet text

The trick is the -r option, a built-in function installed in read bash . This will not refer to \ as an escape character in the read string. Then it loops until the \word{...} pattern is in the string. Then an internal matching line is printed, and input compression is interrupted.

For small files (1-2 MB) I will use this version because it uses very minimal resources. But for large files, I suggest using anubhava perl-regex - grep , since it reads the file much more efficiently!

+4

Truey Jul 11 '13 at 10:39

source share

Since not all grep versions have PCRE, here is a solution using only extended regular expression.

grep -Eo "\\word{.+}" file_name | sed -e "s/\\word{//" -e "s/}//"

+3

user1613254 Jul 07 '13 at 7:13

source share

 $ cat testfile This \word{is} some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}. $ awk '$0 ~ /\\word{[^}]*}/ { nelts = split($0, arr, /\\word{/); for (i=1; i <= nelts; i++) if (arr[i] ~ /^[^}]*}/) print substr(arr[i], 1, index(arr[i], "}") - 1); }' testfile is more text This yet text

If it were \word{\word{STRING}} , STRING would be printed. In other words, it works recursively. Sorry if this is not what you wanted.

+1

Chrono kitsune Jul 11 '13 at 5:17

source share

Mix grep and sed:

 egrep -o '\\word\{[^\{\}]+\}' | sed 's/\\word{//;s/}//'

For fun, I also compiled a clean version of bash:

 while read -rl do n=${#l} ll="${l#*\\word{}" while [ $n -ne ${#ll} ] do echo "${ll%%\}*}" n=${#ll} ll="${ll#*\\word{}" done done

Not very clean, but it works on your example

+1

Bentoy13 Jul 11 '13 at 9:57

source share

Code for GNU sed :

 sed -nr ':b;s/(\\word\{[^}]+\})/\1\n/;s/.*\\word\{([^}]+)\}\n/\1\n/;T;P;D;tb' file

  $ cat file
 This \ word {is} some text.
 This is some \ word {more text}.
 \ word {This} is \ word {yet} some more \ word {text}.
 {\ word {This} is \ word {yet} {some} more \ word {text}.}

 $ sed -nr ': b; s / (\\ word \ {[^}] + \}) / \ 1 \ n /; s /.* \\ word \ {([^}] +) \} \ n / \ 1 \ n /; T; P; D; tb 'file
 is
 more text
 This
 yet
 text
 This
 yet
 text

+1

captcha Jul 16 '13 at 8:15

source share

awk was invented for word processing:

 $ awk 'sub(/.*\\word{/,"")' RS='}' file is more text This yet text is $ cat file This \word{is} some text. This is some \word{more text}. \word{This} is \word{yet} some more \word{text}. { This \word{is} some text }

+1

Ed morton Jul 17 '13 at 4:04

source share

perl may also help:

 perl -nlE 'say "$_" for (m/\\word\{(.*?)\}/g);' < tex.txt

for this input:

 This{ \word{is}} some text. This is some \word{more text}. This is {some \word{aaa text}} This is {some \word{bbb text} This is some \word{ccc text}} This is some {\word{ddd text}} {\word{This} is \word{yet} some more \word{text}.}

prints:

 is more text aaa text bbb text ccc text ddd text This yet text

+1

kobame Jul 17 '13 at 18:00

source share

With sed :

 sed 's/.*\\word{\([^}]*\)}.*/\1/g' input.txt

The expression above deletes everything except what is inside the brackets. If in the future it turns out that you need to match multiple lines, awk might be simpler:

 awk -F "\\word{" 'BEGIN { RS = "}" } { print $2 }' input.txt

This sets \word{ as a field separator and } as a record separator, implying that $2 refers to what's inside the brackets.

+1

gbrener Jul 17 '13 at 9:38

source share

anubhava · Accepted Answer · 2013-07-07T06:56:32+0000

grep with PCRE capabilities will do the job:

 grep -Po "(?<=\\word{)[^}]*(?=})" file

How to find all words appearing between `\ word {}` in BASH?

Live Demo: http://ideone.com/uzEzBF

More articles: