Removing all HTML tags from a web page

Question

Removing all HTML tags from a web page

I am running several BASH shell scripts with curl . If my curl command returns any text, I know that I have an error. This text returned by curl is usually found in HTML. I decided that if I can remove all the HTML tags, I can display the resulting text as an error message.

I was thinking of something like this:

 sed -E 's/<.*?>//g' <<<$output_text

But I get sed: 1: "s/<.*?>//": RE error: repetition-operator operand invalid

If I replaced *? to * , I won’t get an error (and I don’t get any text either). If I remove the global flag ( g ), I get the same error.

This is on Mac OS X.

+4

bash regex sed

David W. Jul 24 '13 at 21:23

source share

3 answers

Could it be a perl solution based on a parser?

 perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html

You must install the HTML :: Strip module with the cpan HTML::Strip command.

as an alternative

you can use standard OS X utility: textutil see man page

 textutil -convert txt file.html

will create file.txt with split html tags or

 textutil -convert txt -stdin -stdout < file.txt | some_command

Another alternative

Some systems have the lynx text browser installed. You can use:

 lynx -dump file.html #or lynx -stdin -dump < file.html

But in your case, you can only rely on pure sed or awk ... IMHO solutions.

But, if you have perl (and just not the HTML :: Strip module), the following is still better than sed

 perl -0777 -pe 's/<.*?>//sg'

because it will also remove the following (multi-line and common) tag:

 <a href="#" class="some" >link text</a>

+3

jm666 Jul 24 '13 at 23:01

source share

Code for GNU sed :

 sed '/</ {:ks/<[^>]*>//g; /</ {N; bk}}' file

This can lead to an error, it is better to use html-parsing .

+1

captcha Jul 24 '13 at 22:00

source share

Kent · Accepted Answer · 2013-07-24T21:25:45+0000

sed does not support non-greedy.

try

 's/<[^>]*>//g'

Removing all HTML tags from a web page

More articles: