How to create a badly encoded string in ruby?

I have a file somewhere in production, to which I do not have access to it when loading a ruby ​​script, the regular expression against the contents fails with ArgumentError => invalid byte sequence in UTF-8 .

I believe that I have a fix based on the answer with all the points here: ruby 1.9: invalid byte sequence in UTF-8

 # Remove all invalid and undefined characters in the given string # (ruby 1.9.3) def safe_str str # edited based on matt comment (thanks matt) s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '') s.encode!('utf-8', 'utf-16') end 

However, now I want to build my rspec to make sure the code works. I do not have access to the file that caused the problem, so I want to create a badly encoded string programmatically.

I tried options for things like:

 bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s} bad_str.length.should > safe_str(bad_str).length 

or,

 bad_str = (100..1000).to_a.pack(c*) bad_str.length.should > safe_str(bad_str).length 

but the length is always the same. I also tried different ranges of characters; not always from 100 to 1000.

Any suggestions on how to build a string with invalid encoding in a ruby ​​1.9.3 script?

+4
source share
3 answers

Your safe_str method will (currently) never do anything with a string, it's non-op. The docs for String#encode in Ruby 1.9.3 say :

Note that converting from enc enc enc enc enc to the same enc enc enc enc is not an operation, that is, the recipient returns without any changes, and no exceptions occur even if there are invalid bytes.

This is true for the current version 2.0.0 (patch level 247), however a recent commit for the Ruby tranche modifies this and also introduces the scrub method, which pretty much does what you want.

Until a new version of Ruby is released, you will need to round your text string to a different encoding and return to clear it, as in the second example in this answer to the question you are related to , something like:

 def safe_str str s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '') s.encode!('utf-8', 'utf-16') end 

Please note that your first example trying to create an invalid string does not work:

 bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s} bad_str.valid_encoding? # => true 

From << docs :

If the object is an Integer, it is considered a code point and converted to a character before concatenation.

This way you always get a valid string.

The second method, using pack , will create an ASCII-8BIT . If you then change this using force_encoding , you can create a UTF-8 string with invalid encoding:

 bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8') bad_str.valid_encoding? # => false 
+3
source

Many single-byte strings will make the UTF-8 string invalid, starting at 0x80. Therefore 128.chr should work.

+2
source

In the specification tests written by Ive, I did not find a way to fix this bad encoding:

Period% Basics

Line %B sequentially creates ArgumentError: invalid byte sequence in UTF-8 .

0
source

Source: https://habr.com/ru/post/1497055/


All Articles