Sed - include a new line in the template

I'm still a noob for shell scripts, but I'm trying. The following is a partially working script shell that should remove all JS from * .htm documents by matching tags and removing their contents. For instance. <script src=""> , <script></script> and <script type="text/javascript">

 find $1 -name "*.htm" > ./patterns for p in $(cat ./patterns) do sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p done 

The problem with this script is that since sed reads the text attachment one at a time, this script will not work as expected with new lines. Duration:

 <script> //Foo </script> 

will remove the first script tag, but omit "foo" and the closing tag, which I don't want.

Is there a way to match newlines in my regex? Or, if sed doesn't work, is there anything else I can use?

+4
source share
3 answers

Assuming you have <script> tags on different lines, for example. sort of:

 foo bar <script type="text/javascript"> some JS </script> foo 

the following should work:

 sed '/<script/,/<\/script>/d' inputfile 
+4
source

This awk script will look for the <script*> , set the in variable, and then read the next line. When a closing tag </script*> detected, the variable is set to zero. The final print pattern prints all lines if the in variable is zero.

 awk '/<script.*>/ { in=1; next } /<\/script.*>/ { if (in) in=0; next } { if (!in) print; } ' $1 
+1
source

As you mentioned, the problem is that sed handles line-by-line input.

So the simplest workaround is to make input a single line, for example. replacing newline characters with a character that you are sure does not exist at your input.

One could tempt the use of tr :

 … |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n' 

However , tr currently only fully supports single-byte characters, "and to be safe, you probably want to use some incredible character like Λ‡ for which tr doesn't help.

Fortunately, the same thing can be achieved using sed , using branching .

Let's go back to the <script>…</script> example, this works and will (according to the previous link) be cross-platform:

 … |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/Λ‡/g' -e 's~<script>.*</script>~~g' -e 's/Λ‡/\n/g' 

Or in a more concise form, if you use GNU sed and do not need cross-platform compatibility:

 … |sed ':a;N;$!ba;s/\n/Λ‡/g;s~<script>.*</script>~~g;s/Λ‡/\n/g' 

For more information on the branching part ( :a;N;$!ba; ), see the related answer in the "Using Branching" section. The rest is simple:

  • s/\n/Λ‡/g replaces all newline characters with Λ‡
  • s~<script>.*</script>~~g deletes what needs to be removed (beware that some fixing is required for its actual use: since it will delete everything between the first <script> and the last </script> , also note that I used ~ instead of / to avoid escaping the slash in </script> : I could use almost any single-byte character, except for a few reserved ones, such as \ );
  • s/Λ‡/\n/g readlines newlines.
0
source

Source: https://habr.com/ru/post/1491591/


All Articles