If you look at your string character with
>>> data = "कुरुक्षेत्र" >>> re.findall(".", data) ['क', 'ु', 'र', 'ु', 'क', '्', 'ष', 'े', 'त', '्', 'र']
And your other line
>>> data = "धर्मक्षेत्रे" >>> re.findall(".", data) ['ध', 'र', '्', 'म', 'क', '्', 'ष', 'े', 'त', '्', 'र', 'े']
So, you probably divide them by '् '् . We will call them symbols of designation at the moment. If you type ord(data[2]) for the first character of a note, this is 2381 . Now, if you research this value
>>> for i in range(2350, 2400): ... print(i, chr(i)) ... 2350 म 2351 य 2352 र 2353 ऱ 2354 ल 2355 ळ 2356 ऴ 2357 व 2358 श 2359 ष 2360 स 2361 ह 2362 ऺ 2363 ऻ 2364 ़ 2365 ऽ 2366 ा 2367 ि 2368 ी 2369 ु 2370 ू 2371 ृ 2372 ॄ 2373 ॅ 2374 ॆ 2375 े 2376 ै 2377 ॉ 2378 ॊ 2379 ो 2380 ौ 2381 ् 2382 ॎ 2383 ॏ 2384 ॐ 2385 ॑ 2386 ॒ 2387 ॓ 2388 ॔ 2389 ॕ 2390 ॖ 2391 ॗ 2392 क़ 2393 ख़ 2394 ग़ 2395 ज़ 2396 ड़ 2397 ढ़ 2398 फ़ 2399 य़
We are mainly interested in the values between 2362 and 2391 . Therefore, we create an array of such values
>>> split = "" >>> for i in range(2362, 2392): ... split += chr(i)
Next, we want to find the entire template with or without an appropriate symbol.
>>> re.findall(".[" + split + "]?", "धर्मक्षेत्रे") ['ध', 'र्', 'म', 'क्', 'षे', 'त्', 'रे'] >>> re.findall(".[" + split + "]?", "कुरुक्षेत्र") ['कु', 'रु', 'क्', 'षे', 'त्', 'र']
This should come close to what you are probably looking for. If you need more complex processing, you will have to go with the link @OphirYoktan sent