How to get the correct list of unicode characters in python

I am trying to search for emoticons in python strings. Therefore, I, for example,

em_test = ['\U0001f680'] print(em_test) ['πŸš€'] test = 'This is a test string πŸ’°πŸ’°πŸš€' if any(x in test for x in em_test): print ("yes, the emoticon is there") else: print ("no, the emoticon is not there") yes, the emoticon is there 

and if em_test search in

'This is a test line πŸ’°πŸ’°πŸš€'

I really can find him.

So, I created a csv file with all the emoticons that I want to determine by their unicode. CSV is as follows:

\ U0001F600

\ U0001F601

\ U0001F602

\ U0001F923

and when I import and print it, I can’t get emoticons, but just a textual representation:

 ['\\U0001F600', '\\U0001F601', '\\U0001F602', '\\U0001F923', ... ] 

and therefore, I can’t use this to search for these emoticons on another line ... I somehow know that a double backslash is only a representation of one slash, but for some reason the reader in Unicode does not understand ... I don’t know what I am missing.

Any suggestions?

+5
source share
2 answers

You can decode these Unicode escape sequences with .decode('unicode-escape') . However .decode is a bytes method, so if these sequences are text, not bytes, you first need to encode them in bytes. In addition, you can (possibly) open the CSV file in binary mode to read these sequences as bytes , and not as text strings.

Just for fun, I also use unicodedata to get the names of these emojis.

 import unicodedata as ud emojis = [ '\\U0001F600', '\\U0001F601', '\\U0001F602', '\\U0001F923', ] for u in emojis: s = u.encode('ASCII').decode('unicode-escape') print(u, ud.name(s), s) 

Output

 \U0001F600 GRINNING FACE πŸ˜€ \U0001F601 GRINNING FACE WITH SMILING EYES 😁 \U0001F602 FACE WITH TEARS OF JOY πŸ˜‚ \U0001F923 ROLLING ON THE FLOOR LAUGHING 🀣 

This should be much faster than using ast.literal_eval . And if you read the data in binary mode, it will be even faster, since it avoids the initial decoding step when reading the file, and also eliminates the call to .encode('ASCII') .

You can make decoding a little more reliable using

 u.encode('Latin1').decode('unicode-escape') 

but this is not necessary for your emoji data. And, as I said earlier, it would be even better if you opened the file in binary mode to avoid the need to encode it.

+2
source

1. keeping your csv as is:

this is a bloated solution, but using ast.literal_eval works:

 import ast s = '\\U0001F600' x = ast.literal_eval('"{}"'.format(s)) print(hex(ord(x))) print(x) 

I get 0x1f600 (this is the correct char code) and some emoticon character (πŸ˜€). (well, I had to copy / paste the weird char from my console into this response text box, but there is a problem with the console at my end, otherwise it works)

just surround with quotes to allow ast to accept input as a string.

2. Using character codes directly

you might be better off by saving the character codes themselves instead of the \U format:

 print(chr(0x1F600)) 

does the same (so ast slightly overloaded)

your csv may contain:

 0x1F600 0x1F601 0x1F602 0x1F923 

then chr(int(row[0],16)) will do the trick when reading: example, if one line in CSV (or first line)

 with open("codes.csv") as f: cr = csv.reader(f) codes = [int(row[0],16) for row in cr] 
+1
source

Source: https://habr.com/ru/post/1273320/


All Articles