How do you use Unicode characters in a regular expression in Ruby?

I am trying to write a line of code that will take a line of Japanese text and delete a specific character set. However, I am having problems using Unicode characters inside a regex.

I am currently using text.gsub(/ใ€Š.*?ใ€‹/u, '') , but getting an error

 'gsub': invalid byte sequence in Windows-31J (Argument error) 

Can someone tell me what I am doing wrong?

Sample text: ใ ใฎ ไป• ่‰ "ใ— ใ ใ•" ใŒ ใ‚ ใพ ใ‚Š ใซ ็„ก ้€ ไฝœ ใ‚€ ใ‚€ ใž ใ• ใ• ใ  ใ  ใฃ ใŸ ใฎ ใง

Expected Result: ใ ใฎ ไป• ่‰ ใŒ ใ‚ ใพ ใ‚Š ใซ ็„ก ้€ ไฝœ ใ  ใฃ ใŸ ใฎ ใฎ

thanks

edit: # encoding: utf-8 present at the top of the script.

+4
source share
1 answer

Try the following:

 text.encode('utf-8', 'utf-8').gsub(/ใ€Š.*?ใ€‹/u, '') 
+2
source

Source: https://habr.com/ru/post/1399702/


All Articles