Is it possible to replace a string with a dictionary?

I would like to change all characters with emphasis on characters without an accent:

conversion_dict = {"ä": "a", "ö": "o", "ü": "u","Ä": "A", "Ö": "O", "Ü": "U", "á": "a", "à": "a", "â": "a", "é": "e", "è": "e", "ê": "e", "ú": "u", "ù": "u", "û": "u", "ó": "o", "ò": "o", "ô": "o", "Á": "A", "À": "A", "Â": "A", "É": "E", "È": "E", "Ê": "E", "Ú": "U", "Ù": "U", "Û": "U", "Ó": "O", "Ò": "O", "Ô": "O","ß": "s"} 

Is there a way to do something like "paragraph of text".replace([conversion_dict]) ?

+6
source share
5 answers

third-party module preferred method

Much better than the method below is to use the awesome unidecode module:

 >>> import unidecode >>> somestring = u"äüÊÂ" >>> unidecode.unidecode(somestring) 'auEA' 

built-in, slightly dangerous method

The conclusion from your question that you want to normalize Unicode characters is actually a good, built-in way to do this:

 >>> somestring = u"äüÊÂ" >>> somestring u'\xe4\xfc\xca\xc2' >>> import unicodedata >>> unicodedata.normalize('NFKD', somestring).encode('ascii', 'ignore') 'auEA' 

Check out the documentation for unicodedata.normalize .

Please note, however, that there may be some problems with this. See this post for a nice explanation and some workarounds.

See also latin-1-to-ascii for alternatives.

+8
source
 for k, v in conversion_dict.items(): txt = txt.replace(k, v) 

ETA: This is not "terrible" at all. Here is the timer for the case of the toy, where we replace the line with 100,000 characters, using a dictionary that contains mappings of 56 characters, where none of the characters are in the line:

 import timeit NUM_REPEATS = 100000 conversion_dict = dict([(chr(i), "C") for i in xrange(100)]) txt = "A" * 100000 def replace(x): for k, v in conversion_dict.items(): x = x.replace(k, v) t = timeit.Timer("replace(txt)", setup="from __main__ import replace, txt") print t.timeit(NUM_REPEATS) / NUM_REPEATS, "sec / call" 

On my computer, I get runtime

 0.0056938188076 sec / call 

So one hundredth per second for 100,000 characters. Now some of the characters will actually be in the string, and this will slow it down, but in almost any reasonable situation, replaced characters will be much less likely than other characters. However, jterrace's answer is perfect.

+5
source

This is a VFAQ. See this SO question or google "python asciify" or "python unaccent".

To create a decent dictionary for use with unicode.translate , you need an approach that will automatically detect simple cases and find those where you need to do manual recording. A good approach is to break through the BMP, looking at what is produced unicodedata.name(the_ordinal, "") .

Auto discovery: re.match("LATIN (SMALL|CAPTTAL) LETTER ([AZ]) WITH ", name)

Otherwise, if you get a match with "LATIN (SMALL|CAPTTAL) LETTER [AZ].+" , You will need to record manually.

Important note : unicode.translate uses "matching Unicode orders with Unicode numbers, Unicode or None strings " ... so you can replace, for example. THORN's capital is on "Th."

This is why using unicodedata.normalize not a good idea:

Characters whose normalized first character is NOT in the ASCII range are deleted. This includes not only all punctuation marks (which you may not like), but letters that are NOT “accented,” for example. ß

 >>> from unicodedata import name, normalize >>> for i in range(0xA0, 0x100): ... c = unichr(i) ... a = normalize('NFKD', c).encode('ascii', 'ignore') ... if not a: ... print("FAIL: U+%04X %s" % (i, name(c))) ... FAIL: U+00A1 INVERTED EXCLAMATION MARK FAIL: U+00A2 CENT SIGN FAIL: U+00A3 POUND SIGN FAIL: U+00A4 CURRENCY SIGN FAIL: U+00A5 YEN SIGN FAIL: U+00A6 BROKEN BAR FAIL: U+00A7 SECTION SIGN FAIL: U+00A9 COPYRIGHT SIGN FAIL: U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK FAIL: U+00AC NOT SIGN FAIL: U+00AD SOFT HYPHEN FAIL: U+00AE REGISTERED SIGN FAIL: U+00B0 DEGREE SIGN FAIL: U+00B1 PLUS-MINUS SIGN FAIL: U+00B5 MICRO SIGN FAIL: U+00B6 PILCROW SIGN FAIL: U+00B7 MIDDLE DOT FAIL: U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK FAIL: U+00BF INVERTED QUESTION MARK FAIL: U+00C6 LATIN CAPITAL LETTER AE FAIL: U+00D0 LATIN CAPITAL LETTER ETH FAIL: U+00D7 MULTIPLICATION SIGN FAIL: U+00D8 LATIN CAPITAL LETTER O WITH STROKE FAIL: U+00DE LATIN CAPITAL LETTER THORN FAIL: U+00DF LATIN SMALL LETTER SHARP S <<<<<<<<<<========== ß FAIL: U+00E6 LATIN SMALL LETTER AE FAIL: U+00F0 LATIN SMALL LETTER ETH FAIL: U+00F7 DIVISION SIGN FAIL: U+00F8 LATIN SMALL LETTER O WITH STROKE FAIL: U+00FE LATIN SMALL LETTER THORN >>> 
+4
source

Maybe something like this will work? I have not tried, but this seems like a simple solution.

 for key in string: if key in dict: string = string.replace(key, dict[key]) 
0
source

Any solution that ignores the source encoding and input encoding will be formally correct, but it can easily work.

First you must be sure that you know the encoding of the input, then you must match it with the encoding in which you entered the card, and then everything will be fine. You can use unicode as an internal encoding and convert a known input encoding into it.

Try reading this on this.

0
source

Source: https://habr.com/ru/post/908657/


All Articles