I clear the html and run the "wrong byte sequence errors". I followed the advice in another post and inserted the following two lines of code:
doc_scores.encode!('UTF-16', :undef => :replace, :invalid => :replace, :replace => "") doc_scores.encode!('UTF-8')
This helped to significantly reduce the number of errors, however, I still get the following exception in about 10-20% of cases (in other words, about 1 out of every 5 html scans):
Input is not proper UTF-8, indicate encoding ! Bytes: 0xEA 0x20 0x20 0x22
This is always the same sequence of bytes. Any ideas on how I should fix them?
source share