What is the best way to remove HTML from a string?

Question

What is the best way to remove HTML from a string?

I recently started using the following RegEx in the ReReplace () function to cut HTML tags from a string using ColdFusion. Please note: I do not use this as protection against XSS or SQL injection; it is only to remove existing and safe HTML from the string before displaying in the HTML header attribute.

REReplaceNoCase(str,"<[^>]*>","","ALL")

In the semi-task question, I asked how to change my RegEx to include spaces and line breaks. I was told that using RegEx is not suitable for this purpose, and this post was listed as an explanation.

I strongly suspect that the regular expressions you posted are actually not working correctly. I would advise you not to use regular expressions to parse HTML, since HTML is not a common language. Use an HTML parser instead. ( Mark Byers )

If so, what is a suitable tool to remove HTML from a string before rendering? (Given that HTML is already safe, it is misinformed before entering the database).

I know HTMLEditFormat () and HTMLCodeFormat () , but these two functions do not provide what I need; the previous one replaces special characters with their HTML escaped equivalents, while the latter does the same, but also wraps the string with a <pre> .

What I would like to do is clear the line from HTML and line breaks before I display in the HTML header attribute <a title="My string without HTML goes here">...</a>

There are times when HTML is not required. For example, you want to display an excerpt from a message without HTML stored with it, for example.

+3

coldfusion regex

Mohamad Dec 29 '10 at 0:19

source share

3 answers

Use chilkat html parser chilkat . We used this in our training project to get all the content and hyperlinks from html pages to create a basic search engine.

+1

A_var Dec 29 '10 at 4:16

source share

If an HTML snippet should be included in the title, perhaps you can cover all the databases with regular expressions and enough testing.

However, as a general hint, if you have to process a larger fragment, I would go on the XML / DOM path with Java, either by understanding dom4j, or capturing text, or, most likely, using Stringbuilding the result with the SAX parser.

[EDIT] When I first answered, I was about to write that the HTML should be reasonably well formed, but suggested that you have at least some control over the source. However, if you don’t have one, I’ll just get in touch with JTidy and TagSoup , of course, without testing them, but they are definitely the first thing I would experience to consume HTML in the real world with CF.

+1

Pif Jan 2 '10 at 22:56

source share

Charles · Accepted Answer · 2010-12-29T01:46:17+0000

I do not agree with the arguments you are quoting. While HTML should not be parsed with regexen, descriptors are ideal for them.

But you need to be more careful than just <[^>]*> , as that would turn

 <span title=">">...</span>

in poorly formed

 ">...</span>

So you need something like <([^">]|"[^"]*"|'[^']*')*> . You can separate line breaks with a character replacement instead of a regular expression, but if you prefer a regular expression, you can use something like \n (or even combine it with the above using alternation, but this is even less efficient).

What is the best way to remove HTML from a string?

More articles: