Finding Sequential Siblings with XPath

Here's an easy moment for an XPath expert! :)

Document structure:

<tokens> <token> <word>Newt</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Gingrich</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>admires</word><entityType>VERB</entityType> </token> <token> <word>Garry</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Trudeau</word><entityType>PROPER_NOUN</entityType> </token> </tokens> 

Ignoring the semantic improbability of the document, I want to pull out [["Nit", "Gingrich"], ["Harry", "Trudeau]]], that is: if there are two chips in a row whose entityTypes: PROPER_NOUN, I want to extract words from these two tokens.

I got to:

 "//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']" 

... which allows you to find the second of two consecutive PROPER_NOUN tokens, but I'm not sure how to make it emit the first token with it.

Some notes:

  • I do not mind processing higher level NodeSets (e.g. in Ruby / Nokogiri) if this simplifies the problem.
  • If there are three or more consecutive PROPER_NOUN tokens (name them A, B, C), ideally I would like to emit [A, B], [B, C].

Update

Here my solution uses higher level Ruby functions. But I'm tired of all those XPath bullies who kicked the sand on my face, and I would like to know how REAL XPath encoders do it!

 def extract(doc) names = [] sentences = doc.xpath("//tokens") sentences.each do |sentence| tokens = sentence.xpath("token") prev = nil tokens.each do |token| name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN" names << [prev, name] if (name && prev) prev = name end end names end 
+4
source share
4 answers

I would do it in two steps. The first step is to select a set of nodes:

 //token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']] 

This gives you all the token that trigger a two-word pair. Then, to get the actual pair, iterate over the node list and extract ./word and the following-sibling::token[1]/word

Using XmlStarlet ( http://xmlstar.sourceforge.net/ - a great tool for quickly processing XML files), command line

 xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml 

gives

 Newt,Gingrich Garry,Trudeau 

XmlStarlet will also compile this command line in xslt, the corresponding bit

  <xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]"> <xsl:value-of select="word"/> <xsl:value-of select="','"/> <xsl:value-of select="following-sibling::token[1]/word"/> <xsl:value-of select="'&#10;'"/> </xsl:for-each> 

Using Nokogiri, it might look something like this:

 #parse the document doc = Nokogiri::XML(the_document_string) #select all tokens that start 2-word pair pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]' #extract each word and the following one result = pair_starts.each_with_object([]) do |node, array| array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text] end 
+1
source

This is an XPath 1.0 expression :

  /*/token [entityType='PROPER_NOUN' and following-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word 

selects all "nouns-words" first in a pair "

This is an XPath expression :

 /*/token [entityType='PROPER_NOUN' and preceding-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word 

Selects all nouns "second in a pair"

You will need to create the actual pairs that accept the kth node of each of the two received node-sets results.

XSLT Based Validation :

 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select= "/*/token [entityType='PROPER_NOUN' and following-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word "/> ============== <xsl:copy-of select= "/*/token [entityType='PROPER_NOUN' and preceding-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word "/> </xsl:template> </xsl:stylesheet> 

simply evaluates the two XPath expressions and displays the results of these two evaluations (using a suitable separator to visualize the end of the first result and the beginning of the second result).

When applied to the provided XML document:

 <tokens> <token> <word>Newt</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Gingrich</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>admires</word><entityType>VERB</entityType> </token> <token> <word>Garry</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Trudeau</word><entityType>PROPER_NOUN</entityType> </token> </tokens> 

conclusion :

 <word>Newt</word> <word>Garry</word> ============== <word>Gingrich</word> <word>Trudeau</word> 

, and zipping two results (which you will indicate in your favorite PL):

 ["Newt", "Gingrich"] 

and

 ["Garry", "Trudeau"] 

When the same conversion is applied to this XML document (note that we have one triplex):

 <tokens> <token> <word>Newt</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Gingrich</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Rep</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>admires</word><entityType>VERB</entityType> </token> <token> <word>Garry</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Trudeau</word><entityType>PROPER_NOUN</entityType> </token> </tokens> 

result now :

 <word>Newt</word> <word>Gingrich</word> <word>Garry</word> ============== <word>Gingrich</word> <word>Rep</word> <word>Trudeau</word> 

and smoothing out two results gives the correct, desired end result:

 ["Newt", "Gingrich"], ["Gingrich", "Rep"], 

and

 ["Garry", "Trudeau"] 

Take a note :

The desired result can be obtained using a single XPath 2.0 expression. Let me know if you are interested in XPath 2.0.

+1
source

XPath returns a node or set of nodes, but does not return a group. Thus, you need to determine the beginning of each group, and then take the rest.

 first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word" next = "../following-sibling::token[1]/word" doc.xpath(first).map{|word| [word.text, word.xpath(next).text] } 

Output:

 [["Newt", "Gingrich"], ["Garry", "Trudeau"]] 
0
source

Only XPath is not efficient enough for this task. But it is very easy in XSLT:

 <xsl:for-each-group select="token" group-adjacent="entityType"> <xsl:if test="current-grouping-key="PROPER_NOUN"> <xsl:copy-of select="current-group"> <xsl:text>====</xsl:text> <xsl:if> </xsl:for-each-group> 
0
source

Source: https://habr.com/ru/post/1434307/


All Articles