Extract HTML parts using groovy

Question

Extract HTML parts using groovy

I need to extract some of the HTML from this HTML page. So far, I'm using XmlSlurper with tagoup to parse an HTML page, and then try to get the part I need using StreamingMarkupBuilder:

import groovy.xml.StreamingMarkupBuilder def html = "<html><body>a <b>test</b></body></html>" def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html) println new StreamingMarkupBuilder().bindNode(dom.body)

However i get the result

 <html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>

which looks great, but I would like to get it without the html namespace.

How to avoid namespace?

+6

html groovy xmlslurper

rdmueller Apr 25 '11 at 15:55

source share

1 answer

ataylor · Accepted Answer · 2011-04-25T17:39:54+0000

Disable the namespace function in the TagSoup parser. Example:

 import groovy.xml.StreamingMarkupBuilder def html = "<html><body>a <b>test</b></body></html>" def parser = new org.ccil.cowan.tagsoup.Parser() parser.setFeature(parser.namespacesFeature, false) def dom = new XmlSlurper(parser).parseText(html) println new StreamingMarkupBuilder().bindNode(dom.body)

Extract HTML parts using groovy

More articles: