How to change deprecated iconv to String # encode for invalid UTF8 correction

I get sources from the Internet, and sometimes the encoding of the material does not match the 100% UTF8 byte sequence. I use iconv to silently ignore these sequences to get a cleared string.

@iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8') valid_string = @iconv.iconv(untrusted_string) 

However, now iconv is deprecated, I see its warning about obsolescence a lot.

iconv will be deprecated in the future, use String # encode

I tried converting it using the String#encode :invalid and :replace options, but it doesn't seem to work (i.e. the incorrect byte sequence has not been deleted). What is the correct way to use String # encode for this?

+4
source share
2 answers

The question Martijn is associated with has two best ways to do this, but Martijn made an understandable but incorrect change when copying the second approach to his answer here. Executing .encode ('UTF-8', <options>). Encode ('UTF-8') does not work. As indicated in the original answer in another question, the key must be encoded in a different encoding, and then back to UTF-8. If your source string is already marked as UTF-8 in the Ruby internals, then ruby ​​will ignore any call to encode it as UTF-8.

In the following examples, I'm going to use "a # {0xFF.chr) b" .force_encoding ('UTF-8') to create a string that, in Ruby's opinion, is UTF-8 but contains invalid UTF-8 bytes.

 1.9.3p194 :019 > "a#{0xFF.chr}b".force_encoding('UTF-8') => "a\xFFb" 1.9.3p194 :020 > "#{0xFF.chr}".force_encoding('UTF-8').encoding => #<Encoding:UTF-8> 

Note how UTF-8 encoding does nothing:

 1.9.3p194 :016 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8') => "a\xFFb" 

But encoding something else (UTF-16) and then back to UTF-8 clears the line:

 1.9.3p194 :017 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8') => "ab" 
+6
source

In this question the answer was given:

Is there a way in ruby ​​1.9 to remove invalid byte sequences from strings?

Use

 untrusted_string.chars.select{|i| i.valid_encoding?}.join 

or

 untrusted_string.encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8') 
+7
source

Source: https://habr.com/ru/post/1394580/


All Articles