Syllabication of Devanagari

Question

Syllabication of Devanagari

I'm trying to put together the words of Devanagari

धर्मक्षेत्रे → धर् मक् षेत् रे dharmakshetre → dhar mak shet re

wd.split('्')

I get the result as:

 ['धर', 'मक', 'षेत', 'रे']

What is partially correct

I will try another word कुरुक्षेत्र → कु रुक् षेत् रे kurukshetre → ku ruk she tre

 ['कुरुक', 'षेत', 'रे']

The result is obviously incorrect.

How to efficiently extract syllables?

+5

python string python-3.x utf devanagari

Echchama nayak Oct 29 '17 at 13:21

source share

2 answers

Check out the unicodedata module .

 >>> import unicodedata >>> word = 'कुरुक्षेत्र'

Names assigned to each character:

 >>> for ch in word: print(unicodedata.name(ch)) DEVANAGARI LETTER KA DEVANAGARI VOWEL SIGN U DEVANAGARI LETTER RA DEVANAGARI VOWEL SIGN U DEVANAGARI LETTER KA DEVANAGARI SIGN VIRAMA DEVANAGARI LETTER SSA DEVANAGARI VOWEL SIGN E DEVANAGARI LETTER TA DEVANAGARI SIGN VIRAMA DEVANAGARI LETTER RA

General category assigned to each character:

 >>> for ch in word: print(unicodedata.category(ch)) Lo Mn Lo Mn Lo Mn Lo Mn Lo Mn Lo

FileFormat.info has a list of Unicode character categories.

Make sure this is what you want to achieve:

 import unicodedata def split_clusters(txt): """ Generate grapheme clusters for the Devanagari text.""" stop = '्' cluster = u'' end = None for char in txt: category = unicodedata.category(char) if (category == 'Lo' and end == stop) or category == 'Mn': cluster = cluster + char else: if cluster: yield cluster cluster = char end = char if cluster: yield cluster

Function Testing:

 >>> list(split_clusters('धर्मक्षेत्रे')) ['ध', 'र्म', 'क्षे', 'त्रे'] >>> list(split_clusters('कुरुक्षेत्र')) ['कु', 'रु', 'क्षे', 'त्र']

0

srig Nov 05 '17 at 15:56

source share

Tarun lalwani · Accepted Answer · 2017-11-03T06:01:13+0000

If you look at your string character with

 >>> data = "कुरुक्षेत्र" >>> re.findall(".", data) ['क', 'ु', 'र', 'ु', 'क', '्', 'ष', 'े', 'त', '्', 'र']

And your other line

 >>> data = "धर्मक्षेत्रे" >>> re.findall(".", data) ['ध', 'र', '्', 'म', 'क', '्', 'ष', 'े', 'त', '्', 'र', 'े']

So, you probably divide them by '् '् . We will call them symbols of designation at the moment. If you type ord(data[2]) for the first character of a note, this is 2381 . Now, if you research this value

 >>> for i in range(2350, 2400): ... print(i, chr(i)) ... 2350 म 2351 य 2352 र 2353 ऱ 2354 ल 2355 ळ 2356 ऴ 2357 व 2358 श 2359 ष 2360 स 2361 ह 2362 ऺ 2363 ऻ 2364 ़ 2365 ऽ 2366 ा 2367 ि 2368 ी 2369 ु 2370 ू 2371 ृ 2372 ॄ 2373 ॅ 2374 ॆ 2375 े 2376 ै 2377 ॉ 2378 ॊ 2379 ो 2380 ौ 2381 ् 2382 ॎ 2383 ॏ 2384 ॐ 2385 ॑ 2386 ॒ 2387 ॓ 2388 ॔ 2389 ॕ 2390 ॖ 2391 ॗ 2392 क़ 2393 ख़ 2394 ग़ 2395 ज़ 2396 ड़ 2397 ढ़ 2398 फ़ 2399 य़

We are mainly interested in the values between 2362 and 2391 . Therefore, we create an array of such values

 >>> split = "" >>> for i in range(2362, 2392): ... split += chr(i)

Next, we want to find the entire template with or without an appropriate symbol.

 >>> re.findall(".[" + split + "]?", "धर्मक्षेत्रे") ['ध', 'र्', 'म', 'क्', 'षे', 'त्', 'रे'] >>> re.findall(".[" + split + "]?", "कुरुक्षेत्र") ['कु', 'रु', 'क्', 'षे', 'त्', 'र']

This should come close to what you are probably looking for. If you need more complex processing, you will have to go with the link @OphirYoktan sent

Syllabication of Devanagari

More articles: