Removing all HTML tags from a web page

I am running several BASH shell scripts with curl . If my curl command returns any text, I know that I have an error. This text returned by curl is usually found in HTML. I decided that if I can remove all the HTML tags, I can display the resulting text as an error message.

I was thinking of something like this:

 sed -E 's/<.*?>//g' <<<$output_text 

But I get sed: 1: "s/<.*?>//": RE error: repetition-operator operand invalid

If I replaced *? to * , I won’t get an error (and I don’t get any text either). If I remove the global flag ( g ), I get the same error.

This is on Mac OS X.

+4
source share
3 answers

sed does not support non-greedy.

try

 's/<[^>]*>//g' 
+5
source

Could it be a perl solution based on a parser?

 perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html 

You must install the HTML :: Strip module with the cpan HTML::Strip command.

as an alternative

you can use standard OS X utility: textutil see man page

 textutil -convert txt file.html 

will create file.txt with split html tags or

 textutil -convert txt -stdin -stdout < file.txt | some_command 

Another alternative

Some systems have the lynx text browser installed. You can use:

 lynx -dump file.html #or lynx -stdin -dump < file.html 

But in your case, you can only rely on pure sed or awk ... IMHO solutions.

But, if you have perl (and just not the HTML :: Strip module), the following is still better than sed

 perl -0777 -pe 's/<.*?>//sg' 

because it will also remove the following (multi-line and common) tag:

 <a href="#" class="some" >link text</a> 
+3
source

Code for GNU :

 sed '/</ {:ks/<[^>]*>//g; /</ {N; bk}}' file 

This can lead to an error, it is better to use html-parsing .

+1
source

Source: https://habr.com/ru/post/1493311/


All Articles