How to parse this piece of HTML?

Good morning! I am using C # (framework 3.5sp1) and want to parse the following html fragment using regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

I need the following output:

  • group 1: h1 content
  • group 2: the content of the h1-following text
  • group 3-n: content of subtitles + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd sub-signature + contents (e.g. 1, 3, ...) because of the end <hr/>. for parsing h1-caption, I have another template ( <h1.*?>(.*?)</h1>) that gives me only the title, but not the content - I'm fine with this atm.

Does anyone have a hint / solution for me or any alternative logics (e.g. parsing html through a reader and assigning it this way?)?


HTMLAgilityPack, . <h1> -tag.
... myproblem . : - <p> <div> <ul>... atm ...? ?

+3
4
+9

regex HTML. HTML Agility Pack.

+6

:

REGEX - , , html.

HtmlAgilityPack. , . , .

SGMLReader - , . , , html.

http://developer.mindtouch.com/SgmlReader

Majestic-12 - , , SGMLReader.

http://www.majestic12.co.uk/projects/html_parser.php

SGMLreader (VB.net)

Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)    
Dim XNS As XNamespace 

' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
      XNS = htmldoc.Root.GetDefaultNamespace
Catch
        XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
        XNS = "http://www.w3.org/1999/xhtml"
End If

'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
        Scripts &= link.Value
Next

Majestic-12 , "". dll.

+2

As already mentioned, use the HtmlAgilityPack. However, if you like jQuery / CSS selectors, I just found an HtmlAgilityPack fork called Fizzler: http://code.google.com/p/fizzler/ Using this, you can find all the tags <p>using:

var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();

Or find a specific div like <div id="myDiv"></div>:

var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');

It couldn't be easier!

+1
source

Source: https://habr.com/ru/post/1729034/


All Articles