The content inside the <ph> tags has been encoded to preserve the reserved < and > characters.
A clean way to handle this is to allow Nokogiri to re-process these pieces in XML:
require 'nokogiri' doc = Nokogiri::XML('<seg>Trennmesser <ph><I.FIGREF ITEM="3" FORMAT="PARENTHESIS"></ph><bpt i="1"><I.FIGTARGET TARGET="CIADDAJA"></bpt><ept i="1"></I.FIGREF></ept></seg>') ph = Nokogiri::XML::DocumentFragment.parse(doc.at('seg ph').content) puts ph.to_xml
Which displays the following node, showing that Nokogiri correctly recreated this fragment:
<I.FIGREF ITEM="3" FORMAT="PARENTHESIS"/>
To extract text inside the <seg> :
doc.at('//seg/text()').text => "Trennmesser "
When working with HTML or XML, you should never assume that regex is the best way to extract anything. Both HTML and XML are too irregular and βflexibleβ (where flexibility means that it is often annoyingly distorted or defined in completely unique and unexpected ways).
To get the full content inside the <seg> in the second question:
require 'nokogiri' doc = Nokogiri::XML('<seg>Hilfsmittel <ph>< F34@Z7 @Lge></ph>X <ph><F0></ph>= 0,5mm zwischen Beschleunigerwalze <ph>< F34@Z7 @Lge></ph>D<ph><F0></ph> und Trennmesser schieben.</seg>') seg = Nokogiri::XML::DocumentFragment.parse(doc.at('seg').content) puts seg.content
What outputs:
Hilfsmittel @ Z7@Lge >X = 0,5mm zwischen Beschleunigerwalze @ Z7@Lge >D und Trennmesser schieben.
source share