Find and replace characters in a file using Python

I am trying to do transliteration when I need to replace every source character in English from a file with its equivalent from a dictionary that I use in the source code corresponding to another language in Unicode format. Now I can read the character by character from a file in English, how to look for its equivalent card in the dictionary that I defined in the source code, and make sure that it is printed in a new transliterated output file. Thanks:).

+4
source share
2 answers

The translate method of Unicode objects is the easiest and fastest way to perform the required transliteration. (I assume that you are using Unicode rather than simple byte strings, which would make it impossible for characters such as 'ΰ€ͺΰ€€ΰ₯ΰ€°' !).

All you have to do is position the transliteration dictionary correctly, as indicated in the documents I pointed out to you:

  • each key must be an integer, a Unicode code character; for example, 0x0904 is the code point for ΰ€„ , AKA "DEVANAGARI LETTER SHORT A", so for transliteration you must use the integer 0x0904 as the key in the dict (equivalent to the decimal value 2308). (For a code point table for many South Asian scenarios, see this pdf ).

  • the corresponding value can be a Unicode sequence number, a Unicode string (presumably you will use transliteration, for example u'a' for your task, if you want to transliterate the letter A Devanagari short A into the English letter' a ') or None (if during "transliteration" you just want to remove instances of this Unicode character).

Characters that are not found as keys in the dict are passed intact from input to output.

Once your recorder is laid out in this way, output_text = input_text.translate(thedict) does all the transliteration for you - and pretty damn fast. You can apply this to Unicode blocks of text of any size that will be conveniently located in memory - basically, it makes one text file, since time will be very good on most machines (for example, wonderful - and huge - Mahabharata takes no more than several tens of megabytes in any of the freely downloadable forms - Sanskrit [[stitched with both Devanagari and Latin transliterated forms]], English translation - from this site ).

+3
source

Note. Updated after clarification from the respondent. Please read the OP comments attached to this answer.

Something like that:

 for syllable in input_text.split_into_syllables(): output_file.write(d[syllable]) 

Here output_file is a file object open for writing. d is a dictionary where indexes are your source characters and values ​​are output characters. You can also try reading the file line by line rather than reading everything right away.

0
source

Source: https://habr.com/ru/post/1301190/


All Articles