What about unidecode ?
import unidecode a = [u'Ich weiß es nicht!', u'¡No lo sé!', u'Ég veit ekki!'] for s in a: print(unidecode.unidecode(s).replace(' ', '_'))
This produces clean ASCII strings that can be easily processed if they still contain unwanted characters. Keeping spaces of various shapes as underscores helps with readability.
Ich_weiss_es_nicht! !No_lo_se! Eg_veit_ekki!
If uniqueness is a problem, a hash or something like that can be added to the strings.
Edit:
Some clarification seems to be required regarding hashing. Many hash functions are explicitly designed to produce very different outputs for close input. For example, the python built-in hash function gives:
In [1]: hash('¡No lo sé!') Out[1]: 6428242682022633791 In [2]: hash('¡No lo se!') Out[2]: 4215591310983444451
With this you can do something like
unidecode.unidecode(s).replace(' ', '_') + '_' + str(hash(s))[:10]
to get not too long lines. Even with such reduced hashes, collisions are pretty unlikely.
source share