Remove some tags from html document using C #

I have an html document and I want to remove all divs from a specific class (with all content). What is the easiest way to do this?

Thank you for your help.

UPDATED:

I tried the Html Agility Pack, as you advised, but I could not reach the goal. I have the following code

static void Main() { HtmlDocument document = new HtmlDocument(); document.Load(FileName); HtmlNode node = document.DocumentNode; HandleNode(node); } 
  private static void HandleNode(HtmlNode node) { while (node != null) { if (node.Name == "div") { var attribute = node.Attributes.Where(x => x.Name == "class" && x.Value == "NavContent"); if (attribute.Any()) node.Remove(); } foreach (var childNode in node.ChildNodes) { HandleNode(childNode); } } } 

code> But I do not want this. Recursion never ends, and the name node is always a comment. Here's the htmp document I'm trying to parse: http://en.wiktionary.org/wiki/work Is there a good example of how to work with the Html Agility Pack? What is wrong with this piece of code?
+4
source share
4 answers

Depending on how complex your HTML is , you will probably need the Agility Pack library.

Update:

HandleNode () contains a while(node != null) , but never assigns a node. I would change this to if(...) for starters.

+9
source

You are looking for HTML Agility Pack .

+2
source

To solve your problem, you can use LINQ:

 foreach(var node in doc.DocumentNode .Descendants("div") .Where(d => d.GetAttributeValue("class", "").IndexOf("NavContent") >= 0) .ToArray()) node.Remove(); 
+2
source

I usually solve this problem through File I / O and RegEx (which is generally not recommended for handling xml / html documents, as commentators commented).

However, if you want to do it right, I'm sure there is a DOM object in C # there.

This one seems to support XPath query, which is pretty handy.

0
source

Source: https://habr.com/ru/post/1304137/


All Articles