How to get a mechanism for automatically converting a body to UTF8?

I found several solutions using post_connect_hook and pre_connect_hook , but it seems like they are not working. I am using the latest version of Mechanize (2.1). There are no [:response] fields in the new version, and I do not know where to get them in the new version.

Is it possible that Mechanize will return the encoded version of UTF8 instead of converting it manually using iconv ?

+4
source share
4 answers

Since Mechanize 2.0, the arguments pre_connect_hooks() and post_connect_hooks() have been changed.

See the Mechanize documentation:

pre_connect_hooks ()

A list of intercepts to call before receiving a response. Hooks are called using the agent, URI, response, and response body.

post_connect_hooks ()

List of hooks to call after receiving a response. Hooks are called using the agent, URI, response, and response body.

Now you cannot change the value of the internal body of the response, because the argument is not an array. So, the next best way is to replace the internal parser with your own:

 class MyParser def self.parse(thing, url = nil, encoding = nil, options = Nokogiri::XML::ParseOptions::DEFAULT_HTML, &block) # insert your conversion code here. For example: # thing = NKF.nkf("-wm0X", thing).sub(/Shift_JIS/,"utf-8") # you need to rewrite content charset if it exists. Nokogiri::HTML::Document.parse(thing, url, encoding, options, &block) end end agent = Mechanize.new agent.html_parser = MyParser page = agent.get('http://somewhere.com/') ... 
+3
source

I found a solution that works very well:

 class HtmlParser def self.parse(body, url, encoding) body.encode!('UTF-8', encoding, invalid: :replace, undef: :replace, replace: '') Nokogiri::HTML::Document.parse(body, url, 'UTF-8') end end Mechanize.new.tap do |web| web.html_parser = HtmlParser end 

No problems found.

+1
source

How about something like this:

 class Mechanize alias_method :original_get, :get def get *args doc = original_get *args doc.encoding = 'utf-8' doc end end 
0
source

In the script, just type: page.encoding = 'utf-8'

However, depending on your scenario, you may also need to enter the opposite (the Mechanize site encoding is used instead). To do this, open Firefox, open the website on which you want to work Mechanize, select "Tools" in the menu bar, and then open "Page Information". Determine what the page is encoded from.

Using this information, you must enter what is encoded in this page (for example, page.encoding = 'windows-1252' ).

0
source

Source: https://habr.com/ru/post/1391016/


All Articles