Ruby 1.9 regex encoding

I parse this channel http://www.sixapart.com/labs/update/developers/ with nokogiri and then run some regex for the contents of some tags. The contents of UTF-8 are mostly, but sometimes corrupt. However, for my case, I do not care, and I just need to convey the correct parts of the content, so I'm glad to treat the data as binary / ASCII -8BIT. The problem is that no matter what I do, the regular expressions in my script are treated as UTF-8 or ASCII. Regardless of what I set for the coding comment, or what I do to create a regular expression.

Is there a solution? Can I force a regex to binary? Can I make gsub without regex? (I just replace & c &)

+3
source share
2 answers

You need to encode the start line and use the FIXEDENCODING parameter.

1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #<Encoding:ASCII-8BIT>
+3
source

Stringshave a coding property. Try using the method String#force_encodingbefore applying regex.

UPD: for your regular expression to be ascii, look at the accepted answer here: Ruby 1.9: Regular expressions with unknown input encoding

def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end
0
source

Source: https://habr.com/ru/post/1772500/


All Articles