Remove some tags from html document using C #

Question

Remove some tags from html document using C #

I have an html document and I want to remove all divs from a specific class (with all content). What is the easiest way to do this?

Thank you for your help.

UPDATED:

I tried the Html Agility Pack, as you advised, but I could not reach the goal. I have the following code

static void Main() { HtmlDocument document = new HtmlDocument(); document.Load(FileName); HtmlNode node = document.DocumentNode; HandleNode(node); }

  private static void HandleNode(HtmlNode node) { while (node != null) { if (node.Name == "div") { var attribute = node.Attributes.Where(x => x.Name == "class" && x.Value == "NavContent"); if (attribute.Any()) node.Remove(); } foreach (var childNode in node.ChildNodes) { HandleNode(childNode); } } }

code> But I do not want this. Recursion never ends, and the name node is always a comment. Here's the htmp document I'm trying to parse: http://en.wiktionary.org/wiki/work Is there a good example of how to work with the Html Agility Pack? What is wrong with this piece of code?

+4

html c #

Stuffhappens Mar 15 '10 at 12:17

source share

4 answers

You are looking for HTML Agility Pack .

+2

SLaks Mar 15 '10 at 12:23

source share

To solve your problem, you can use LINQ:

 foreach(var node in doc.DocumentNode .Descendants("div") .Where(d => d.GetAttributeValue("class", "").IndexOf("NavContent") >= 0) .ToArray()) node.Remove();

+2

SLaks Mar 15 '10 at 13:44

source share

I usually solve this problem through File I / O and RegEx (which is generally not recommended for handling xml / html documents, as commentators commented).

However, if you want to do it right, I'm sure there is a DOM object in C # there.

This one seems to support XPath query, which is pretty handy.

0

Vinzz Mar 15 '10 at 12:20

source share

Henk holterman · Accepted Answer · 2010-03-15T12:22:39+0000

Depending on how complex your HTML is , you will probably need the Agility Pack library.

Update:

HandleNode () contains a while(node != null) , but never assigns a node. I would change this to if(...) for starters.

Remove some tags from html document using C #

Update:

More articles: