Removing a <div> from a text file?
Ive made a small program in C # .net that really doesn’t accomplish most of the goal, it tells you about the possibility of your DOOM based on today's LOL news. It takes RSS when downloading from the BBC website and then searches for keywords that either increase the decrease in the percentage probability of DOOM.
A crazy little project that maybe one day the classes will bring you the convenience of using it again for something more important.
I get RSS in xml format, but it contains a lot of div tags and formatting characters that I really don't want to be in the keywords database,
What is the best way to remove these unwanted characters and divs?
Thanks,
Ash
If you want to also remove DIV tags with content:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Exit: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Sort of:
string txt = Regex.Replace(htmlString, @"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove, you will of course modify the regular expression. You will find a lot of material on this and other methods if you do a web search for "strip html C #" .
SO question Rendering or converting Html to rich text (.NET) can also help .