Is there a way to clear the file "wrong byte sequence in UTF-8" in Ruby?

I tried everything before posting on StackOverflow. I really hope someone can help, but I'm pretty desperate.

So, I have a service that uploads data to our database through XML feeds provided by clients. Often these XML files pretend to be UTF-8 encoded, but obviously have quite a few invalid byte sequences. I can clean these files and import them into our database by simply running the following Linux command before importing:

tr -cd '^[:print:]' < original.xml > clean.xml 

Simply executing this single Linux command allows me to import all the data into my database using Nokogiri in Rails.

The problem is that we are deploying to Heroku, and I cannot preprocess the file with the Linux command. Last week I searched the Internet for my own Rails-based solutions, but none of them work. Before I look through all the suggestions I tried, here is my original code:

 data_source = ARGV[0] data_file = open data_source data_string = data_file.read doc = Nokogiri::XML.parse(data_string) doc.xpath(".//job").each do |node| hash = node.element_children.each_with_object(Hash.new) do |e, h| h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content data.push(newrow) end end 

Running this in the source file causes an error: "Incorrect sequence of bytes in UTF-8"

Here are all the helpful tips I tried, but all failed.

  • Use encoder

    Coder.clean! (data_string, "UTF-8")

  • Power coding

    data_string.force_encoding ('BINARY'). encode ('UTF-8' ,: undef =>: replace ,: replace => '')

  • Convert to UTF-16 and return to UTF-8

    data_string.encode! ('UTF-16', 'UTF-8' ,: invalid =>: replace ,: replace => '') data_string.encode! ('UTF-8', 'UTF-16')

  • Use valid_encoding?

    data_string.chars.select {| me | i.valid_encoding?}. join

    Characters are not deleted; generates "wrong byte sequence" errors.

  • Specify the encoding when opening the file

I actually wrote a function that tries every possible encoding until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of all possible encodings of files):

 @file_encodings.each do |enc| print "#{enc}..." conv_str = "r:#{enc}:utf-8" begin data_file = File.open(fname, conv_str) data_string = data_file.read rescue data_file = nil data_string = "" end data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "") unless data_string.blank? print "\n#{enc} detected!\n" return data_string end 
  • Use Regexp to remove non-printable files:

    data_string.gsub (/ [^ [: print:]] /, "!")! Data_string.gsub (/ [[: CNTRL:] && [^ \ n \ r]] /, "")

(I also tried options, including / [^ a-zA-Z0-9 ~ `! @ # $% ^ & * () -_ = + [{]} \ | ;: '', <.> / \? ] /)

For ALL of the above results are the same ... either errors of "wrong byte sequence" occur, or the file is disabled halfway after reading only 4400 lines.

So, why the "tr" Linux command works fine, and yet NONE of these sentences can do the job in Rails.

What I finished is extremely uncomfortable, but it does its job. I checked every line that stopped Nokogiri (row.last) and looked for strange characters. Each of them that I found added to the character class, and then gsub! Ed out, like this (control characters will not be printed here, but you get the idea):

 data_string.gsub!(/[Crazy Control Characters]/,"") 

But the purist in me insists that there should be a more elegant, general solution.

(I put all of my code back into four spaces, but the editor doesn't seem to pick this out.)

+1
source share
4 answers

Ruby 2.1 has a new String.scrub method that you need.

If the string is an invalid sequence of bytes, replace the invalid bytes with the specified replacement character, else returns self. If a block is specified, replace the invalid bytes with the return value of the block.

Check the docs for more information.

http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub

+6
source

I found this on stackoverflow for some other question, and this worked fine for me too. Assuming data_string is your xml:

data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

+1
source

try using a combination of force_encoding ("ISO-8859-1") and encode ("utf-8"). It helps me once.

 data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil) 
0
source

Thanks for answers. I found something that works by checking out all sorts of combinations of different tools. Hope this is helpful to other people who share the same frustrations.

 data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" ) data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"") 

As you can see, this is a combination of the "encode" method and regular expression to remove control characters (except for newlines).

My testing showed that the file I imported had two problems: (1) invalid UTF-8 byte sequences; and (2) non-printable control characters that caused Nokogiri to stop parsing to the end of the file. I had to fix both problems in this order, otherwise gsub! throws an error "incorrect byte sequence."

Note that the first line in the above code can be replaced by EITHER with the following result with the same successful result:

 Coder.clean!(data_string,'UTF-8') 

or

 data_string.scrub!("") 

This worked great for me.

0
source

Source: https://habr.com/ru/post/1264319/