Sed - include a new line in the template

Question

Sed - include a new line in the template

I'm still a noob for shell scripts, but I'm trying. The following is a partially working script shell that should remove all JS from * .htm documents by matching tags and removing their contents. For instance. <script src=""> , <script></script> and <script type="text/javascript">

 find $1 -name "*.htm" > ./patterns for p in $(cat ./patterns) do sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p done

The problem with this script is that since sed reads the text attachment one at a time, this script will not work as expected with new lines. Duration:

 <script> //Foo </script>

will remove the first script tag, but omit "foo" and the closing tag, which I don't want.

Is there a way to match newlines in my regex? Or, if sed doesn't work, is there anything else I can use?

+4

regex shell sed cygwin

Goofyball Jul 16 '13 at 8:16

source share

3 answers

This awk script will look for the <script*> , set the in variable, and then read the next line. When a closing tag </script*> detected, the variable is set to zero. The final print pattern prints all lines if the in variable is zero.

 awk '/<script.*>/ { in=1; next } /<\/script.*>/ { if (in) in=0; next } { if (!in) print; } ' $1

+1

suspectus Jul 16 '13 at 8:29

source share

As you mentioned, the problem is that sed handles line-by-line input.

So the simplest workaround is to make input a single line, for example. replacing newline characters with a character that you are sure does not exist at your input.

One could tempt the use of tr :

 … |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n'

However , tr currently only fully supports single-byte characters, "and to be safe, you probably want to use some incredible character like ˇ for which tr doesn't help.

Fortunately, the same thing can be achieved using sed , using branching .

Let's go back to the <script>…</script> example, this works and will (according to the previous link) be cross-platform:

 … |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ˇ/g' -e 's~<script>.*</script>~~g' -e 's/ˇ/\n/g'

Or in a more concise form, if you use GNU sed and do not need cross-platform compatibility:

 … |sed ':a;N;$!ba;s/\n/ˇ/g;s~<script>.*</script>~~g;s/ˇ/\n/g'

For more information on the branching part ( :a;N;$!ba; ), see the related answer in the "Using Branching" section. The rest is simple:

s/\n/ˇ/g replaces all newline characters with ˇ
s~<script>.*</script>~~g deletes what needs to be removed (beware that some fixing is required for its actual use: since it will delete everything between the first <script> and the last </script> , also note that I used ~ instead of / to avoid escaping the slash in </script> : I could use almost any single-byte character, except for a few reserved ones, such as \ );
s/ˇ/\n/g readlines newlines.

0

Skippy le grand gourou Mar 24 '17 at 11:37

source share

devnull · Accepted Answer · 2013-07-16T08:33:31+0000

Assuming you have <script> tags on different lines, for example. sort of:

 foo bar <script type="text/javascript"> some JS </script> foo

the following should work:

 sed '/<script/,/<\/script>/d' inputfile

Sed - include a new line in the template

More articles: