Regex python with unicode symbol (japanese)

Question

Regex python with unicode symbol (japanese)

I want to delete the part of the line (shown in bold) below, this is saved in the oldString line

[DMSM-8433] 加護亜依 Kago Ai - 加護亜依 vs FRIDAY

im using the following regex inside python

p=re.compile(ur"( [\W]+) (?=[A-Za-z ]+–)", re.UNICODE) newString=p.sub("", oldString)

when i print newString nothing was deleted

+5

python regex unicode

Paul thomas 30 sept '15 at 10:19

source share

2 answers

I think you should use a regex like this:

 ([\p{Hiragana}\p{Katakana}\p{Han}]+)

see also this documentation.

EDIT: I also tested it here .

0

teoreda 30 sept '15 at 10:30

source share

Wiktor stribiżew · Accepted Answer · 2015-09-30T14:18:24+0000

To solve the problem, you can use the following snippet:

 #!/usr/bin/python # -*- coding: utf-8 -*- import re str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY' regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)' p = re.compile(regex, re.U) match = p.sub("", str) print match.encode("UTF-8")

See the IDEONE demo

Besides the declaration # -*- coding: utf-8 -*- , I added the @nhahtdh character class to detect Japanese characters .

Note that match needs to be encoded as a UTF-8 string “manually”, since Python 2 needs to be “reminded” that we work with Unicode all the time.

Regex python with unicode symbol (japanese)

More articles: