How to remove unprintable / invisible characters in a ruby?

Sometimes I have evil non-printable characters in the middle of a line. These lines are user inputs, so I have to make my program well-received, and not try to change the source of the problem.

For example, they can have zero width without a space in the middle of the line. For example, when parsing a .po file, one problematic part was the line "he is a man of god" in the middle of the file. Although everything seems to be correct, checking it with irb shows:

  "he is a man of god".codepoints => [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100] 

I believe that I know what BOM , and I even do it very well. However, sometimes I have such characters in the middle of the file, so this is not a BOM .

My current approach is to remove all the characters that I found evil in a really smelly manner:

 text = (text.codepoints - CODEPOINTS_BlACKLIST).pack("U*") 

The closest I got was this post , which led me to the option :print: for regular expressions. However, for me it was not good:

 "m".scan(/[[:print:]]/).join.codepoints => [65279, 109] 

so the question is: How to remove all non-printable characters from a string in ruby?

+7
source share
3 answers

Ruby can help you convert one multibyte character set to another. Check the search results and also check out the encode Ruby String method.

In addition, Ruby Iconv is your friend.

Finally, James Gray wrote a series of articles that elaborate on this.

Using these tools, you can tell them to transcode a visually similar character or completely ignore them.

Working with alternative character sets is one of the most ... annoying things I had to do, because files can contain anything but be marked as text. You may not expect this, and then your code will die or start to generate errors, because people are so inventive when they find ways to insert alternative characters into the content.

+2
source

try the following:

 >>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.') =>"aaa.d.abcd" 
+17
source

I also had the same problem in ROR version 3.9.3, and I used Visual Studio 2010 as my editor. Notepad ++ solved my problem.

If you are using Notepad ++ and the problem is in the UTF-8 file:

  • Open file
  • From the Encoding menu, select Encode to UTF-8 without Specification as shown in the screenshot.

Screenshot where it showing the aforesaid menu item

Learn more. Refer to this question.

0
source

Source: https://habr.com/ru/post/944920/


All Articles