As you mentioned, the problem is that sed handles line-by-line input.
So the simplest workaround is to make input a single line, for example. replacing newline characters with a character that you are sure does not exist at your input.
One could tempt the use of tr :
β¦ |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n'
However , tr currently only fully supports single-byte characters, "and to be safe, you probably want to use some incredible character like Λ for which tr doesn't help.
Fortunately, the same thing can be achieved using sed , using branching .
Let's go back to the <script>β¦</script> example, this works and will (according to the previous link) be cross-platform:
β¦ |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/Λ/g' -e 's~<script>.*</script>~~g' -e 's/Λ/\n/g'
Or in a more concise form, if you use GNU sed and do not need cross-platform compatibility:
β¦ |sed ':a;N;$!ba;s/\n/Λ/g;s~<script>.*</script>~~g;s/Λ/\n/g'
For more information on the branching part ( :a;N;$!ba; ), see the related answer in the "Using Branching" section. The rest is simple:
s/\n/Λ/g replaces all newline characters with Λs~<script>.*</script>~~g deletes what needs to be removed (beware that some fixing is required for its actual use: since it will delete everything between the first <script> and the last </script> , also note that I used ~ instead of / to avoid escaping the slash in </script> : I could use almost any single-byte character, except for a few reserved ones, such as \ );s/Λ/\n/g readlines newlines.
source share