How to delete an invalid byte sequence?

I clear the html and run the "wrong byte sequence errors". I followed the advice in another post and inserted the following two lines of code:

doc_scores.encode!('UTF-16', :undef => :replace, :invalid => :replace, :replace => "") doc_scores.encode!('UTF-8') 

This helped to significantly reduce the number of errors, however, I still get the following exception in about 10-20% of cases (in other words, about 1 out of every 5 html scans):

 Input is not proper UTF-8, indicate encoding ! Bytes: 0xEA 0x20 0x20 0x22 

This is always the same sequence of bytes. Any ideas on how I should fix them?

+4
source share
1 answer

I understood the solution to my problem. Turns out it was an XML document encoding, which I scraped off, which was the problem. To fix this, I now make the encoding option explicit:

 doc = Nokogiri::XML::Reader(open(url),nil,'ISO-8859-1') 

Before I just:

 doc = Nokogiri::XML::Reader(open(url)) 

Hope this helps someone.

+3
source

Source: https://habr.com/ru/post/1440864/


All Articles