I need to parse sections from an HTML string. For instance:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>[section=quote]</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>[/section]</p>
A parsing section of quotes should return:
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
I am currently using a regex to capture content inside [section = quote] ... [/ section], but since the sections are entered using the WYSIWYG editor, the section tags themselves are wrapped in a paragraph tag, so the result of the analysis:
</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>
The regular expression that I am currently using is:
\[section=(.+?)\](.+?)\[/section\]
And I do an additional cleanup before parsing the sections:
protected string CleanHtml(string input) {
input = Regex.Replace(input, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);
input = Regex.Replace(input, @"<p\s*/>|<p>\s*</p>", string.Empty);
return input;
}
Can someone provide a regex that will achieve what I'm looking for, or am I wasting my time trying to do this with Regex? I saw links to the Html Agility Pack - would it be better for something like that?
[Update]
Oscar Agility Regex . - , .
public void ParseSections(string content)
{
this.SourceContent = content;
this.NonSectionedContent = content;
content = CleanHtml(content);
if (!sectionRegex.IsMatch(content))
return;
var doc = new HtmlDocument();
doc.LoadHtml(content);
bool flag = false;
string sectionName = string.Empty;
var sectionContent = new StringBuilder();
var unsectioned = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p")) {
if (startSectionRegex.IsMatch(n.InnerText)) {
flag = true;
sectionName = startSectionRegex.Match(n.InnerText).Groups[1].Value.ToLowerInvariant();
continue;
}
if (endSectionRegex.IsMatch(n.InnerText)) {
flag = false;
this.Sections.Add(sectionName, sectionContent.ToString());
sectionContent.Clear();
continue;
}
if (flag)
sectionContent.Append(n.OuterHtml);
else
unsectioned.Append(n.OuterHtml);
}
this.NonSectionedContent = unsectioned.ToString();
}