I tried everything before posting on StackOverflow. I really hope someone can help, but I'm pretty desperate.
So, I have a service that uploads data to our database through XML feeds provided by clients. Often these XML files pretend to be UTF-8 encoded, but obviously have quite a few invalid byte sequences. I can clean these files and import them into our database by simply running the following Linux command before importing:
tr -cd '^[:print:]' < original.xml > clean.xml
Simply executing this single Linux command allows me to import all the data into my database using Nokogiri in Rails.
The problem is that we are deploying to Heroku, and I cannot preprocess the file with the Linux command. Last week I searched the Internet for my own Rails-based solutions, but none of them work. Before I look through all the suggestions I tried, here is my original code:
data_source = ARGV[0] data_file = open data_source data_string = data_file.read doc = Nokogiri::XML.parse(data_string) doc.xpath(".//job").each do |node| hash = node.element_children.each_with_object(Hash.new) do |e, h| h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content data.push(newrow) end end
Running this in the source file causes an error: "Incorrect sequence of bytes in UTF-8"
Here are all the helpful tips I tried, but all failed.
Use encoder
Coder.clean! (data_string, "UTF-8")
Power coding
data_string.force_encoding ('BINARY'). encode ('UTF-8' ,: undef =>: replace ,: replace => '')
Convert to UTF-16 and return to UTF-8
data_string.encode! ('UTF-16', 'UTF-8' ,: invalid =>: replace ,: replace => '') data_string.encode! ('UTF-8', 'UTF-16')
Use valid_encoding?
data_string.chars.select {| me | i.valid_encoding?}. join
Characters are not deleted; generates "wrong byte sequence" errors.
Specify the encoding when opening the file
I actually wrote a function that tries every possible encoding until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of all possible encodings of files):
@file_encodings.each do |enc| print "#{enc}..." conv_str = "r:#{enc}:utf-8" begin data_file = File.open(fname, conv_str) data_string = data_file.read rescue data_file = nil data_string = "" end data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "") unless data_string.blank? print "\n#{enc} detected!\n" return data_string end
Use Regexp to remove non-printable files:
data_string.gsub (/ [^ [: print:]] /, "!")! Data_string.gsub (/ [[: CNTRL:] && [^ \ n \ r]] /, "")
(I also tried options, including / [^ a-zA-Z0-9 ~ `! @ # $% ^ & * () -_ = + [{]} \ | ;: '', <.> / \? ] /)
For ALL of the above results are the same ... either errors of "wrong byte sequence" occur, or the file is disabled halfway after reading only 4400 lines.
So, why the "tr" Linux command works fine, and yet NONE of these sentences can do the job in Rails.
What I finished is extremely uncomfortable, but it does its job. I checked every line that stopped Nokogiri (row.last) and looked for strange characters. Each of them that I found added to the character class, and then gsub! Ed out, like this (control characters will not be printed here, but you get the idea):
data_string.gsub!(/[Crazy Control Characters]/,"")
But the purist in me insists that there should be a more elegant, general solution.
(I put all of my code back into four spaces, but the editor doesn't seem to pick this out.)