How to remove unprintable / invisible characters in a ruby?

Question

How to remove unprintable / invisible characters in a ruby?

Sometimes I have evil non-printable characters in the middle of a line. These lines are user inputs, so I have to make my program well-received, and not try to change the source of the problem.

For example, they can have zero width without a space in the middle of the line. For example, when parsing a .po file, one problematic part was the line "he is a man of god" in the middle of the file. Although everything seems to be correct, checking it with irb shows:

  "he is a man of god".codepoints => [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100]

I believe that I know what BOM , and I even do it very well. However, sometimes I have such characters in the middle of the file, so this is not a BOM .

My current approach is to remove all the characters that I found evil in a really smelly manner:

 text = (text.codepoints - CODEPOINTS_BlACKLIST).pack("U*")

The closest I got was this post , which led me to the option :print: for regular expressions. However, for me it was not good:

 "m".scan(/[[:print:]]/).join.codepoints => [65279, 109]

so the question is: How to remove all non-printable characters from a string in ruby?

+7

ruby encoding non-printing-characters

fotanus May 13, '13 at 19:53

source share

3 answers

try the following:

 >>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.') =>"aaa.d.abcd"

+17

snowytoxa Jul 17 '14 at 8:02

source share

I also had the same problem in ROR version 3.9.3, and I used Visual Studio 2010 as my editor. Notepad ++ solved my problem.

If you are using Notepad ++ and the problem is in the UTF-8 file:

Open file
From the Encoding menu, select Encode to UTF-8 without Specification as shown in the screenshot.

Learn more. Refer to this question.

0

Ravimallya Nov 28 '14 at 7:02

source share

the tin man · Accepted Answer · 2013-05-13T19:59:20+0000

Ruby can help you convert one multibyte character set to another. Check the search results and also check out the encode Ruby String method.

In addition, Ruby Iconv is your friend.

Finally, James Gray wrote a series of articles that elaborate on this.

Using these tools, you can tell them to transcode a visually similar character or completely ignore them.

Working with alternative character sets is one of the most ... annoying things I had to do, because files can contain anything but be marked as text. You may not expect this, and then your code will die or start to generate errors, because people are so inventive when they find ways to insert alternative characters into the content.

How to remove unprintable / invisible characters in a ruby?

More articles: