How to create a badly encoded string in ruby?

Question

How to create a badly encoded string in ruby?

I have a file somewhere in production, to which I do not have access to it when loading a ruby script, the regular expression against the contents fails with ArgumentError => invalid byte sequence in UTF-8 .

I believe that I have a fix based on the answer with all the points here: ruby 1.9: invalid byte sequence in UTF-8

 # Remove all invalid and undefined characters in the given string # (ruby 1.9.3) def safe_str str # edited based on matt comment (thanks matt) s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '') s.encode!('utf-8', 'utf-16') end

However, now I want to build my rspec to make sure the code works. I do not have access to the file that caused the problem, so I want to create a badly encoded string programmatically.

I tried options for things like:

 bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s} bad_str.length.should > safe_str(bad_str).length

or,

 bad_str = (100..1000).to_a.pack(c*) bad_str.length.should > safe_str(bad_str).length

but the length is always the same. I also tried different ranges of characters; not always from 100 to 1000.

Any suggestions on how to build a string with invalid encoding in a ruby 1.9.3 script?

+4

ruby character-encoding

Gsp Aug 14 '13 at 15:02

source share

3 answers

Many single-byte strings will make the UTF-8 string invalid, starting at 0x80. Therefore 128.chr should work.

+2

Hew wolff Aug 14 '13 at 18:33

source share

In the specification tests written by Ive, I did not find a way to fix this bad encoding:

Period% Basics

Line %B sequentially creates ArgumentError: invalid byte sequence in UTF-8 .

0

parhamr Aug 14 '13 at 18:58

source share

matt · Accepted Answer · 2013-08-14T20:02:37+0000

Your safe_str method will (currently) never do anything with a string, it's non-op. The docs for String#encode in Ruby 1.9.3 say :

Note that converting from enc enc enc enc enc to the same enc enc enc enc is not an operation, that is, the recipient returns without any changes, and no exceptions occur even if there are invalid bytes.

This is true for the current version 2.0.0 (patch level 247), however a recent commit for the Ruby tranche modifies this and also introduces the scrub method, which pretty much does what you want.

Until a new version of Ruby is released, you will need to round your text string to a different encoding and return to clear it, as in the second example in this answer to the question you are related to , something like:

 def safe_str str s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '') s.encode!('utf-8', 'utf-16') end

Please note that your first example trying to create an invalid string does not work:

 bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s} bad_str.valid_encoding? # => true

From << docs :

If the object is an Integer, it is considered a code point and converted to a character before concatenation.

This way you always get a valid string.

The second method, using pack , will create an ASCII-8BIT . If you then change this using force_encoding , you can create a UTF-8 string with invalid encoding:

 bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8') bad_str.valid_encoding? # => false

How to create a badly encoded string in ruby?

More articles: