XPath to find all the following siblings until the next brother of a particular type

Question

XPath to find all the following siblings until the next brother of a particular type

Given this XML / HTML:

<dl> <dt>Label1</dt><dd>Value1</dd> <dt>Label2</dt><dd>Value2</dd> <dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd> <dt>Label4</dt><dd>Value4</dd> </dl>

I want to find all <dt> , and then for each I will find the next <dd> to the next <dt> .

Using Ruby Nokogiri I can do it like this:

 dl.xpath('dt').each do |dt| ct = dt.xpath('count(following-sibling::dt)') dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]") puts "#{dt.text}: #{dds.map(&:text).join(', ')}" end #=> Label1: Value1 #=> Label2: Value2 #=> Label3: Value3a, Value3b #=> Label4: Value4

However, as you can see, I create a variable in Ruby and then create XPath using it. How can I write a single XPath expression that executes an equivalent?

I guessed:

 following-sibling::dd[count(following-sibling::dt)=count(self/following-sibling::dt)]

but apparently i don't understand what self means.

This question is similar to XPath: select all of the following siblings to another brother , except that there is no unique identifier for the "stop" node.

This question is almost the same as xpath to find all of the following neighboring neighbor nodes next to another type , except that I ask for XPath-only permissions.

+4

ruby xml xpath nokogiri

Phrogz Dec 13 '11 at 16:08

source share

2 answers

This is an interesting question. Most of the problems have already been mentioned in @lwburk's answer and in his comments. Just to open up a little harder, hidden in this question for the casual reader, my answer is probably more complex or more verbose than the OP requires.

XPath 1.0 Features Related to This Problem

In XPath, each step and each node in a set of selected nodes works independently. It means that

a subexpression does not have a general way of accessing the data that was calculated in the previous subexpression, or to exchange the data calculated in this subexpression with other subexpressions
a node has no general way to refer to a node that was used as the node context in the previous subexpression
a node has no general way to refer to other selected nodes.
if all selected nodes must be compared with the same specific node, then the node must be uniquely defined in a way that is common to all selected nodes

(Well, actually, I'm not 100% sure if this list is absolutely correct in every case. If someone knows the XPath quirks better, comment on or correct this answer by editing it.)

Despite the lack of common solutions, some of these limitations can be overcome if there is proper knowledge of the structure of the document, and / or the previously used axis can be “returned” by another axis, which serves as a backlink, i.e. matches only the nodes that were used in the context of the node in the previous expression. A common example of this is the use of the parent axis after the first use of the child axis (the opposite case, from child to parent, is not uniquely reversible without additional information). In such cases, the information from the previous steps is more accurately recreated at a later stage (instead of access to previously known information).

Unfortunately, in this case, I could not come up with any other solution for referencing previously known nodes, except for using XPath variables (which must be defined in advance).

XPath specifies the syntax for referencing a variable, but does not specify the syntax for defining variables; the way that variables are defined depends on the environment in which XPath is used. In fact, since the recommendation says that "the variable bindings used to evaluate a subexpression are always the same as those used to evaluate a containing expression," you can also argue that XPath explicitly prohibits the definition of variables inside an XPath expression.

Reformulated problem

In your question, the problem would be, when specifying <dt> identify the following <dd> elements or the originally specified node after switching the context of the node. The identification of the originally specified <dt> is critical because for each node in the node-set to be filtered, the predicate expression is evaluated using node as the context of the node; therefore, you cannot reference the original <dt> in a predicate unless there is a way to identify it after changing the context. The same applies to the <dd> elements that follow the siblings of a given <dt> .

If you use variables, it would be possible to discuss whether there is a significant difference between 1) using XPath variable syntax and Nokogiri's specific way to declare this variable, or 2) using Nokogiri's Extended Xath syntax, which allows you to use Ruby variables in XPath Expression. In both cases, a variable is defined in its own way, and the meaning of XPath is clear only if a variable definition is available. A similar case can be seen with XSLT, where in some cases you can choose between 1) defining a variable with <xsl:variable> before using your XPath expression or 2) using current() (inside an XPath expression), which is an XSLT extension .

Solution using variable nodes and the Kaysan method

You can select all <dd> elements following the current <dt> element using the following-sibling::dd (set A). You can also select all <dd> elements following the next <dt> element using the following-sibling::dt[1]/following-sibling::dd (set B). Now the given difference A\B leaves the <dd> elements that you really wanted (elements that are in set A but not in set B). If the variable $setA contains nodes A, and the variable $setB contains nodes B, then the difference in the set can be obtained using (modification) of the Kaisan technique:

 dds = $setA[count(.|$setB) != count($setB)]

The simplest workaround without any variables

Currently, your method is to select all the <dt> elements, and then try to associate the value of each such element with the values of the corresponding <dd> elements in one operation. Could this logic of communication be transformed in the other direction? Therefore, you must first select all the <dd> elements, and then for each <dd> find the corresponding <dt> . This means that you access the same <dt> elements several times, and with each operation you add only one new <dd> value. This can affect performance, and Ruby code can be more complex.

The good side is the simplicity of the required XPath. Given a <dd> element, finding the appropriate <dt> is surprisingly simple: preceding-sibling::dt[1]

According to your current Ruby code

 dl.xpath('dd').each do |dd| dt = dd.xpath("preceding-sibling::dt[1]") ## Insert new Ruby magic here ## end

+5

jasso Dec 14 '11 at 17:06

source share

Wayne burkett · Accepted Answer · 2011-12-13T22:15:50+0000

One possible solution:

 dl.xpath('dt').each_with_index do |dt, i| dds = dt.xpath("following-sibling::dd[not(../dt[#{i + 2}]) or " + "following-sibling::dt[1]=../dt[#{i + 2}]]") puts "#{dt.text}: #{dds.map(&:text).join(', ')}" end

It depends on comparing the values of the dt elements and the failure in the presence of duplicates. The following (much more complex) expression is independent of the unique dt values:

 following-sibling::dd[not(../dt[$n]) or (following-sibling::dt[1] and count(following-sibling::dt[1]|../dt[$n])=1)]

Note. Using self not done because you are using it incorrectly as the axis ( self:: . In addition, self always contains only the node context, so it will refer to every dd checked by the expression, and not to the original dt

XPath to find all the following siblings until the next brother of a particular type

XPath 1.0 Features Related to This Problem

Reformulated problem

Solution using variable nodes and the Kaysan method

The simplest workaround without any variables

More articles: