How to delete an invalid byte sequence?

Question

How to delete an invalid byte sequence?

I clear the html and run the "wrong byte sequence errors". I followed the advice in another post and inserted the following two lines of code:

doc_scores.encode!('UTF-16', :undef => :replace, :invalid => :replace, :replace => "") doc_scores.encode!('UTF-8')

This helped to significantly reduce the number of errors, however, I still get the following exception in about 10-20% of cases (in other words, about 1 out of every 5 html scans):

 Input is not proper UTF-8, indicate encoding ! Bytes: 0xEA 0x20 0x20 0x22

This is always the same sequence of bytes. Any ideas on how I should fix them?

+4

ruby html-parsing utf-8

Evan zamir Oct 19 '12 at 19:06

source share

1 answer

Evan zamir · Accepted Answer · 2012-10-19T22:41:21+0000

I understood the solution to my problem. Turns out it was an XML document encoding, which I scraped off, which was the problem. To fix this, I now make the encoding option explicit:

 doc = Nokogiri::XML::Reader(open(url),nil,'ISO-8859-1')

Before I just:

 doc = Nokogiri::XML::Reader(open(url))

Hope this helps someone.

How to delete an invalid byte sequence?

More articles: