How to parse this piece of HTML?

Question

How to parse this piece of HTML?

Good morning! I am using C # (framework 3.5sp1) and want to parse the following html fragment using regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

I need the following output:

group 1: h1 content
group 2: the content of the h1-following text
group 3-n: content of subtitles + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd sub-signature + contents (e.g. 1, 3, ...) because of the end <hr/>. for parsing h1-caption, I have another template ( <h1.*?>(.*?)</h1>) that gives me only the title, but not the content - I'm fine with this atm.

Does anyone have a hint / solution for me or any alternative logics (e.g. parsing html through a reader and assigning it this way?)?

HTMLAgilityPack, . <h1> -tag.
... myproblem . : - <p> <div> <ul>... atm ...? ?

+3

html c# html-agility-pack

Andreas Niedermair 19 . '10 6:49

4

HTML-

+9

YOU 19 . '10 6:51

Mark Byers · Answer 1 · 2010-01-19T06:51:44+0000

regex HTML. HTML Agility Pack.

lexmooze · Answer 2 · 2011-12-19T13:29:59+0000

:

REGEX - , , html.

HtmlAgilityPack. , . , .

SGMLReader - , . , , html.

http://developer.mindtouch.com/SgmlReader

Majestic-12 - , , SGMLReader.

http://www.majestic12.co.uk/projects/html_parser.php

SGMLreader (VB.net)

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

Majestic-12 , "". dll.

Jim brown · Answer 3 · 2012-01-19T16:58:15+0000

As already mentioned, use the HtmlAgilityPack. However, if you like jQuery / CSS selectors, I just found an HtmlAgilityPack fork called Fizzler: http://code.google.com/p/fizzler/ Using this, you can find all the tags <p>using:

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div>:

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It couldn't be easier!

How to parse this piece of HTML?

More articles: