Parsing HTML sections in C #

Question

Parsing HTML sections in C #

I need to parse sections from an HTML string. For instance:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>[section=quote]</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>[/section]</p>

A parsing section of quotes should return:

<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>

I am currently using a regex to capture content inside [section = quote] ... [/ section], but since the sections are entered using the WYSIWYG editor, the section tags themselves are wrapped in a paragraph tag, so the result of the analysis:

</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>

The regular expression that I am currently using is:

\[section=(.+?)\](.+?)\[/section\]

And I do an additional cleanup before parsing the sections:

protected string CleanHtml(string input) {
    // remove whitespace
    input = Regex.Replace(input, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);
    // remove empty p elements
    input = Regex.Replace(input, @"<p\s*/>|<p>\s*</p>", string.Empty);
    return input;
}

Can someone provide a regex that will achieve what I'm looking for, or am I wasting my time trying to do this with Regex? I saw links to the Html Agility Pack - would it be better for something like that?

[Update]

Oscar Agility Regex . - , .

public void ParseSections(string content)
{
    this.SourceContent = content;
    this.NonSectionedContent = content;

    content = CleanHtml(content);

    if (!sectionRegex.IsMatch(content))
        return;

    var doc = new HtmlDocument();
    doc.LoadHtml(content);

    bool flag = false;
    string sectionName = string.Empty;
    var sectionContent = new StringBuilder();
    var unsectioned = new StringBuilder();

    foreach (var n in doc.DocumentNode.SelectNodes("//p")) {               
        if (startSectionRegex.IsMatch(n.InnerText)) { 
            flag = true;
            sectionName = startSectionRegex.Match(n.InnerText).Groups[1].Value.ToLowerInvariant();
            continue;
        }
        if (endSectionRegex.IsMatch(n.InnerText)) {
            flag = false;
            this.Sections.Add(sectionName, sectionContent.ToString());
            sectionContent.Clear();
            continue;
        }

        if (flag)
            sectionContent.Append(n.OuterHtml);
        else
            unsectioned.Append(n.OuterHtml);
    }

    this.NonSectionedContent = unsectioned.ToString();
}

+3

c# regex html-parsing html-agility-pack

Ben Foster 08 . '11 10:14

2

<p>[section=quote]</p>

[section=quote]

<p>[/section]</p>

[/section]

. .

+1

Tomba 08 . '11 11:06

Oscar Mederos · Accepted Answer · 2011-02-17T05:10:31+0000

, HtmlAgilityPack:

using HtmlAgilityPack;

...

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\file.html");


bool flag = false;
var sb = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p"))
{
    switch (n.InnerText)
    {
        case "[section=quote]":
            flag = true;
            continue;
        case "[/section]":
            flag = false;
            break;
    }
    if (flag)
    {
        sb.AppendLine(n.OuterHtml);
    }
}

Console.Write(sb);
Console.ReadLine();

Mauris at turpis nec dolor bibendum sollicitudin ac quis neque. <p>...</p>, n.OuterHtml n.InnerHtml.

, , doc.DocumentNode.SelectNodes("//p") null.
html - , :

var htmlWeb = new HtmlWeb();  
var doc = htmlWeb.Load("http://..../page.html");

Edit:

[section=quote] a [/section] ( <p>), doc.DocumentNode.SelectNodes("//p") doc.DocumentNode.SelectNodes("//*").

Parsing HTML sections in C #

More articles: