Avoiding "Invalid Byte Sequence" When Finding a Text Link Using Nokogiri

Question

Avoiding "Invalid Byte Sequence" When Finding a Text Link Using Nokogiri

I am using Rails 5 with Ruby 4.2 and scanning a document that I processed with Nokogiri, looking in a case-insensitive path for a link with text:

a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil

After receiving the HTML page of my webpage in content , I will parse it in a Nokogiri document using:

 doc = Nokogiri::HTML(content)

The problem is what I get

 ArgumentError invalid byte sequence in UTF-8

on certain web pages using the specified regular expression.

 2.4.0 :002 > doc.encoding => "UTF-8" 2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text } ArgumentError: invalid byte sequence in UTF-8 from (irb):3:in `===' from (irb):3:in `block in irb_binding' from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each' from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto' from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each' from (irb):3:in `detect' from (irb):3 from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start' from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start' from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console' from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!' from /Users/davea/.rvm/gems/ ruby-2.4.0@global /gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>' from bin/rails:4:in `require' from bin/rails:4:in `<main>'

Is there a way to rewrite above to automatically account for encoding or strange characters, rather than inverting it?

+5

ruby ruby-on-rails encoding nokogiri

Dave Feb 16 '17 at 20:56

source share

1 answer

ErvalhouS · Accepted Answer · 2017-02-23T09:53:11+0000

Perhaps your question has already been answered. Have you tried the methods from " Is there a way to clear the file" byte sequence in UTF-8 "in Ruby? ??

In particular, before the detect block, try to delete invalid bytes and control characters, except for a new line:

 doc.scrub!("") doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

Remember scrub! is a Ruby 2.1+ method .

Avoiding "Invalid Byte Sequence" When Finding a Text Link Using Nokogiri

More articles: