Regex python with unicode symbol (japanese)

I want to delete the part of the line (shown in bold) below, this is saved in the oldString line

[DMSM-8433] 加 護 亜 依 Kago Ai - 加 護 亜 依 vs FRIDAY

im using the following regex inside python

p=re.compile(ur"( [\W]+) (?=[A-Za-z ]+–)", re.UNICODE) newString=p.sub("", oldString) 

when i print newString nothing was deleted

+5
source share
2 answers

To solve the problem, you can use the following snippet:

 #!/usr/bin/python # -*- coding: utf-8 -*- import re str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY' regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)' p = re.compile(regex, re.U) match = p.sub("", str) print match.encode("UTF-8") 

See the IDEONE demo

Besides the declaration # -*- coding: utf-8 -*- , I added the @nhahtdh character class to detect Japanese characters .

Note that match needs to be encoded as a UTF-8 string “manually”, since Python 2 needs to be “reminded” that we work with Unicode all the time.

+4
source

I think you should use a regex like this:

 ([\p{Hiragana}\p{Katakana}\p{Han}]+) 

see also this documentation.

EDIT: I also tested it here .

0
source

Source: https://habr.com/ru/post/1232613/


All Articles