This is an XPath 1.0 expression :
/*/token [entityType='PROPER_NOUN' and following-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word
selects all "nouns-words" first in a pair "
This is an XPath expression :
/*/token [entityType='PROPER_NOUN' and preceding-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word
Selects all nouns "second in a pair"
You will need to create the actual pairs that accept the kth node of each of the two received node-sets results.
XSLT Based Validation :
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select= "/*/token [entityType='PROPER_NOUN' and following-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word "/> ============== <xsl:copy-of select= "/*/token [entityType='PROPER_NOUN' and preceding-sibling::token[1]/entityType = 'PROPER_NOUN' ] /word "/> </xsl:template> </xsl:stylesheet>
simply evaluates the two XPath expressions and displays the results of these two evaluations (using a suitable separator to visualize the end of the first result and the beginning of the second result).
When applied to the provided XML document:
<tokens> <token> <word>Newt</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Gingrich</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>admires</word><entityType>VERB</entityType> </token> <token> <word>Garry</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Trudeau</word><entityType>PROPER_NOUN</entityType> </token> </tokens>
conclusion :
<word>Newt</word> <word>Garry</word> ============== <word>Gingrich</word> <word>Trudeau</word>
, and zipping two results (which you will indicate in your favorite PL):
["Newt", "Gingrich"]
and
["Garry", "Trudeau"]
When the same conversion is applied to this XML document (note that we have one triplex):
<tokens> <token> <word>Newt</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Gingrich</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Rep</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>admires</word><entityType>VERB</entityType> </token> <token> <word>Garry</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Trudeau</word><entityType>PROPER_NOUN</entityType> </token> </tokens>
result now :
<word>Newt</word> <word>Gingrich</word> <word>Garry</word> ============== <word>Gingrich</word> <word>Rep</word> <word>Trudeau</word>
and smoothing out two results gives the correct, desired end result:
["Newt", "Gingrich"], ["Gingrich", "Rep"],
and
["Garry", "Trudeau"]
Take a note :
The desired result can be obtained using a single XPath 2.0 expression. Let me know if you are interested in XPath 2.0.