I am using .NET 3.5 (C #) and the HTML Agility Pack to make some web scrapers. Some of the fields I need to extract are structured as paragraphs in which components are separated by line break tags. I would like to be able to select individual components between breaks. Each component can be formed from many elements (i.e., it can be not only one line). Example:
<h3>Section title</h3>
<p>
<b>Component A</b><br />
Component B <i>includes</i> <strong>multiple elements</strong><br />
Component C
</p>
I would like to choose
<b>Component A</b>
Then:
Component B <i>includes</i> <strong>multiple elements</strong>
And then:
Component C
There may also be more ( <br />separated) components.
I can easily get the first component with:
p/br[1]/preceding-sibling::node()
I can also easily get the latest component with:
p/br[2]/following-sibling::node()
, / ( , , node X node Y).
- ndash; , , , XPath , , .
, , XPath , , ( , "). AakashM XPath, .
! , .
2
, , .
:
int i = 0;
do
{
yield return para.SelectNodes(String.Format(
"node()[not(self::br) and count(preceding-sibling::br) = {0}]", i));
++i;
} while (para.SelectSingleNode(String.Format("br[{0}]", i)) != null);
, - XPath, , br. , , - ( , , , , , , XPath).
( , AakashM):
using System;
using System.Collections.Generic;
using System.Xml;
namespace TestXPath
{
class Program
{
static void Main(string[] args)
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(@"
<x>
<h3>Section title</h3>
<p>
<b>Component A</b><br />
Component B <i>includes</i> multiple <strong>elements</strong><br />
Component C
</p>
</x>
");
foreach (var nodes in SplitOnLineBreak(doc.SelectSingleNode("x/p")))
{
Dump(nodes);
Console.WriteLine();
}
Console.ReadLine();
}
private static IEnumerable<XmlNodeList> SplitOnLineBreak(XmlNode para)
{
int i = 0;
do
{
yield return para.SelectNodes(String.Format(
"node()[not(self::br) and count(preceding-sibling::br) = {0}]", i));
++i;
} while (para.SelectSingleNode(String.Format("br[{0}]", i)) != null);
}
private static void Dump(XmlNodeList nodes)
{
foreach (XmlNode node in nodes)
{
Console.WriteLine(string.Format("-->{0}<---",
node.OuterXml));
}
}
}
}