This is pretty easy with 1.9.2, since regular expressions are based on characters in 1.9.2, and 1.9.2 knows the difference between bytes and characters from top to bottom. You are in Rails, so you should get everything in UTF-8. Fortunately, UTF-8 and ASCII overlap for the entire ASCII range, so you can simply delete anything that is not between ' ' and '~' when you have encoded text in UTF-8 format:
>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '') => "Wher is ~pancakes house?"
In fact, there is no reason to go to all these troubles. Ruby 1.9 works great with Unicode, just like Rails and pretty much everything else. Working with non-ASCII text was a nightmare 15 years ago, now it is widespread and fairly straightforward.
If you manage to get text data that is not UTF-8, then you have some options. If the encoding is ASCII-8BIT or BINARY , then you can probably get away with s.force_encoding('utf-8') . If you end up with something other than UTF-8 and ASCII-8BIT , you can use Iconv to re-encode it.
Literature:
source share