Sanitize gem (and Loofah) delete text before leading a colon inside tags

Question

Sanitize gem (and Loofah) delete text before leading a colon inside tags

In some weird behavior with Loofah and Sanitize, trying to clear some html fragments when I noticed that times like “6:30 pm” turn into “30pm”.

Some investigation was carried out and the following was found:

Loofah.scrub_fragment("<span>asdfasdf 6:30 pm</span>", :strip).to_html #=> "<span>asdfasdf 30 pm</span>" Loofah.scrub_fragment("6:30 pm", :strip).to_html #=> "6:30 pm" Loofah.scrub_fragment("<foo>asdfasdf 6&#58;30 pm</foo>", :strip).to_html #=> "asdfasdf 6:30 pm" Loofah.scrub_fragment("bar:30 pm", :strip).to_html #=> "bar:30 pm" Loofah.scrub_fragment("<span>bar:30 pm</span>", :strip).to_html #=> "<span>30 pm</span>" Loofah.scrub_fragment("<span>bar: asdfasdfadsf pm</span>", :strip).to_html #=> "<span>bar: asdfasdfadsf pm</span>"

This applies to all variants of Loofah (: prunes, etc.) and Sanitize, so I assume this is a common code issue for both. Is there anything special I need to do to avoid a colon in the code before disinfection?

Edit 1 I understand that I forgot to mention that I am using jruby (jruby 1.7.0 (1.9.3p203)). I'm trying to figure it out, maybe there may be a problem in nokogiri (which is the basis of both of these gems?)

Edit 2 With further digging, it seems that MAY be a problem in Nokigiri on Jruby (I'm on nokagiri version 1.5.5, for what it's worth). I checked the nokogiri fragment parser on Jruby and on Ruby 1.9.3:

Jruby 1.7.0: Unexpected Results

 doc = Nokogiri::HTML.fragment("<span>3:30pm</span>") => #(DocumentFragment:0x5fbc { name = "#document-fragment", children = [ #(Element:0x5fc0 { name = "span", children = [ #(Text "30pm")] })] })

Ruby 1.9.3: Expected Results

  doc = Nokogiri::HTML.fragment("<span>3:30pm</span>") => #(DocumentFragment:0x3fc4b102055c { name = "#document-fragment", children = [ #(Element:0x3fc4b101fff8 { name = "span", children = [ #(Text "3:30pm")] })] })

Will try to keep digging, but any suggestions are welcome.

+4

security ruby jruby sanitize nokogiri

nilsjesper Nov 16 '12 at 3:09

source share

1 answer

Mark thomas · Accepted Answer · 2012-11-17T22:34:32+0000

I believe this is a regression error in Nokigiri. I was able to reproduce your problem and tried it with several versions of Nokogiri.

It works correctly in 1.5.0:

 jruby-1.6.7.2 :002 > gem 'nokogiri', '=1.5.0' => true jruby-1.6.7.2 :003 > require 'nokogiri' => true jruby-1.6.7.2 :004 > doc = Nokogiri::HTML.fragment("<span>3:30pm</span>") => #<Nokogiri::HTML::DocumentFragment:0x7d4 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x7d2 name="span" children=[#<Nokogiri::XML::Text:0x7d0 "3:30pm">]>]>

This does not work in 1.5.1:

 jruby-1.6.7.2 :002 > gem 'nokogiri', '=1.5.1' => true jruby-1.6.7.2 :003 > require 'nokogiri' => true jruby-1.6.7.2 :004 > doc = Nokogiri::HTML.fragment("<span>3:30pm</span>") => #<Nokogiri::HTML::DocumentFragment:0x7d4 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x7d2 name="span" children=[#<Nokogiri::XML::Text:0x7d0 "30pm">]>]>

Edit : It is important to note that Nokogiri was built around an amazing one that really has no equal in functions, speed and ability to handle incorrect markup. The JRuby implementation is an attempt to match it using Xerces and NekoHTML . I think that they did a great job with the fact that the JRuby implementation almost completely corresponded to the functionality (if not speed) of its MRI counterpart, describing the difference between completely different implementations. However, there are still extreme cases that arise from time to time.

I went ahead and filed a bug report at Nokogiri.

Sanitize gem (and Loofah) delete text before leading a colon inside tags

More articles: