Syllabication of Devanagari

I'm trying to put together the words of Devanagari

धर्मक्षेत्रे → धर् मक् षेत् रे dharmakshetre → dhar mak shet re

wd.split('्') 

I get the result as:

 ['धर', 'मक', 'षेत', 'रे'] 

What is partially correct

I will try another word कुरुक्षेत्र → कु रुक् षेत् रे kurukshetre → ku ruk she tre

 ['कुरुक', 'षेत', 'रे'] 

The result is obviously incorrect.

How to efficiently extract syllables?

+5
source share
2 answers

If you look at your string character with

 >>> data = "कुरुक्षेत्र" >>> re.findall(".", data) ['क', 'ु', 'र', 'ु', 'क', '्', 'ष', 'े', 'त', '्', 'र'] 

And your other line

 >>> data = "धर्मक्षेत्रे" >>> re.findall(".", data) ['ध', 'र', '्', 'म', 'क', '्', 'ष', 'े', 'त', '्', 'र', 'े'] 

So, you probably divide them by '् '् . We will call them symbols of designation at the moment. If you type ord(data[2]) for the first character of a note, this is 2381 . Now, if you research this value

 >>> for i in range(2350, 2400): ... print(i, chr(i)) ... 235023512352235323542355235623572358235923602361236223632364236523662367 ि 23682369237023712372237323742375237623772378237923802381238223832384238523862387238823892390239123922393239423952396239723982399 

We are mainly interested in the values ​​between 2362 and 2391 . Therefore, we create an array of such values

 >>> split = "" >>> for i in range(2362, 2392): ... split += chr(i) 

Next, we want to find the entire template with or without an appropriate symbol.

 >>> re.findall(".[" + split + "]?", "धर्मक्षेत्रे") ['ध', 'र्', 'म', 'क्', 'षे', 'त्', 'रे'] >>> re.findall(".[" + split + "]?", "कुरुक्षेत्र") ['कु', 'रु', 'क्', 'षे', 'त्', 'र'] 

This should come close to what you are probably looking for. If you need more complex processing, you will have to go with the link @OphirYoktan sent

+1
source

Check out the unicodedata module .

 >>> import unicodedata >>> word = 'कुरुक्षेत्र' 

Names assigned to each character:

 >>> for ch in word: print(unicodedata.name(ch)) DEVANAGARI LETTER KA DEVANAGARI VOWEL SIGN U DEVANAGARI LETTER RA DEVANAGARI VOWEL SIGN U DEVANAGARI LETTER KA DEVANAGARI SIGN VIRAMA DEVANAGARI LETTER SSA DEVANAGARI VOWEL SIGN E DEVANAGARI LETTER TA DEVANAGARI SIGN VIRAMA DEVANAGARI LETTER RA 

General category assigned to each character:

 >>> for ch in word: print(unicodedata.category(ch)) Lo Mn Lo Mn Lo Mn Lo Mn Lo Mn Lo 

FileFormat.info has a list of Unicode character categories.

Make sure this is what you want to achieve:

 import unicodedata def split_clusters(txt): """ Generate grapheme clusters for the Devanagari text.""" stop = '्' cluster = u'' end = None for char in txt: category = unicodedata.category(char) if (category == 'Lo' and end == stop) or category == 'Mn': cluster = cluster + char else: if cluster: yield cluster cluster = char end = char if cluster: yield cluster 

Function Testing:

 >>> list(split_clusters('धर्मक्षेत्रे')) ['ध', 'र्म', 'क्षे', 'त्रे'] >>> list(split_clusters('कुरुक्षेत्र')) ['कु', 'रु', 'क्षे', 'त्र'] 
0
source

Source: https://habr.com/ru/post/1272975/


All Articles