Parsing coded tags in a Ruby XML document using Nokogiri and regex

Question

Parsing coded tags in a Ruby XML document using Nokogiri and regex

I am trying to parse XML with tags embedded in tags, for example using Nokigiri and Ruby:

<seg>Trennmesser <ph>&lt;I.FIGREF ITEM=&quot;3&quot; FORMAT=&quot;PARENTHESIS&quot;&gt;</ph><bpt i="1">&lt;I.FIGTARGET TARGET=&quot;CIADDAJA&quot;&gt;</bpt><ept i="1">&lt;/I.FIGREF&gt;</ept></seg>

In this case, I would only need the word "Trennmesser" not in the embedded tags.

In this second example:

 <seg>Hilfsmittel <ph>&lt; F34@Z7 @Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen Beschleunigerwalze <ph>&lt; F34@Z7 @Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>

The words in the closed /ph and open ph tags are also interesting, so the regular expression will need to extract the line " Hilfsmittel 0,5mm zwischen Beschleunigerwalze und Trennmesser schieben. " And discard everything else.

I also added part of the document here:
http://pastebin.com/Q8CdnASz

+4

ruby xml ruby-on-rails parsing nokogiri

Vince Dec 24 '11 at 9:40

source share

2 answers

Cyberfox · Answer 1 · 2011-12-24T10:07:19+0000

Try it on irb

 require 'nokogiri' x = Nokogiri::XML.parse('<seg>Hilfsmittel <ph>&lt; F34@Z7 @Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen Beschleunigerwalze <ph>&lt; F34@Z7 @Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>') x.xpath('//seg').children.reject {|x| x.element?}.join {|x| x.content}

for me it deduces

 => "Hilfsmittel X = 0,5mm zwischen Beschleunigerwalze D und Trennmesser schieben."

The idea here is that we iterate over the children of the <seg> , rejecting those which are the elements themselves ( <ph> ), which should leave only the content elements. Take the resulting array and concatenate the content elements together as one string.

Note that the result is slightly different from the one you described, because there are additional D and X between the two tags.

the tin man · Answer 2 · 2012-09-21T16:53:10+0000

The content inside the <ph> tags has been encoded to preserve the reserved < and > characters.

A clean way to handle this is to allow Nokogiri to re-process these pieces in XML:

 require 'nokogiri' doc = Nokogiri::XML('<seg>Trennmesser <ph>&lt;I.FIGREF ITEM=&quot;3&quot; FORMAT=&quot;PARENTHESIS&quot;&gt;</ph><bpt i="1">&lt;I.FIGTARGET TARGET=&quot;CIADDAJA&quot;&gt;</bpt><ept i="1">&lt;/I.FIGREF&gt;</ept></seg>') ph = Nokogiri::XML::DocumentFragment.parse(doc.at('seg ph').content) puts ph.to_xml

Which displays the following node, showing that Nokogiri correctly recreated this fragment:

 <I.FIGREF ITEM="3" FORMAT="PARENTHESIS"/>

To extract text inside the <seg> :

 doc.at('//seg/text()').text => "Trennmesser "

When working with HTML or XML, you should never assume that regex is the best way to extract anything. Both HTML and XML are too irregular and “flexible” (where flexibility means that it is often annoyingly distorted or defined in completely unique and unexpected ways).

To get the full content inside the <seg> in the second question:

 require 'nokogiri' doc = Nokogiri::XML('<seg>Hilfsmittel <ph>&lt; F34@Z7 @Lge&gt;</ph>X <ph>&lt;F0&gt;</ph>= 0,5mm zwischen Beschleunigerwalze <ph>&lt; F34@Z7 @Lge&gt;</ph>D<ph>&lt;F0&gt;</ph> und Trennmesser schieben.</seg>') seg = Nokogiri::XML::DocumentFragment.parse(doc.at('seg').content) puts seg.content

What outputs:

 Hilfsmittel @ Z7@Lge >X = 0,5mm zwischen Beschleunigerwalze @ Z7@Lge >D und Trennmesser schieben.

Parsing coded tags in a Ruby XML document using Nokogiri and regex

More articles: