Good morning! I am using C # (framework 3.5sp1) and want to parse the following html fragment using regex:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
I need the following output:
- group 1: h1 content
- group 2: the content of the h1-following text
- group 3-n: content of subtitles + text
what i have atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd sub-signature + contents (e.g. 1, 3, ...) because of the end <hr/>. for parsing h1-caption, I have another template ( <h1.*?>(.*?)</h1>) that gives me only the title, but not the content - I'm fine with this atm.
Does anyone have a hint / solution for me or any alternative logics (e.g. parsing html through a reader and assigning it this way?)?
HTMLAgilityPack, . <h1> -tag.
... myproblem . : - <p> <div> <ul>...
atm ...?
?