How to remove unicode in a list

I want to remove a string from Unicode from a list such as airports [U'KATL 'u'KCID']

expected output

[KATL, KCID]

Followed the link below

Separate all items in a list of strings

I tried one of the solutions

my_list = ['this \ n', 'is \ n', 'a \ n', 'list \ n', 'of \ n', 'words \ n']

map (str.strip, my_list) ['this', 'is', 'a', 'list', 'of', 'words']

received the following error:

TypeError: descriptor 'strip' requires object 'str', but received 'unicode'

+5
source share
3 answers

First, I highly recommend you switch to Python 3, which treats Unicode strings as first-class citizens (all strings are Unicode strings, but they are called str ).

But if you need to make it work in Python 2, you can remove unicode strings with unicode.strip (if your strings are true Unicode strings):

 >>> lst = [u'KATL\n', u'KCID\n'] >>> map(unicode.strip, lst) [u'KATL', u'KCID'] 

If your unicode strings are limited to a subset of ASCII, you can convert them to str with:

 >>> lst = [u'KATL', u'KCID'] >>> map(str, lst) ['KATL', 'KCID'] 

Note that this conversion will not be performed for non-ASCII strings. To encode Unicode codes as str (a string of bytes), you need to select your encoding algorithm (usually UTF-8) and use the .encode() method for your strings:

 >>> lst = [u'KATL', u'KCID'] >>> map(lambda x: x.encode('utf-8'), lst) ['KATL', 'KCID'] 
+5
source

The simplest option is a listcomp list:

 [s.strip() for s in my_list] 

If you want to use a map, I would use a lambda to get an object of my own personal strip function, rather than requiring it to be a strip that was delivered by one particular library.

 map(lambda s: s.strip(), my_list) 
+1
source

The only reliable conversion of a unicode string to a byte string is to encode it into an acceptable encoding (the most common are ascii, Latin1, and UTF8). By definition, UTF8 can encode any Unicode character, but in the string you will find non ascii characters, and the size in the byte will no longer be the number of (unicode) characters. Latin1 can represent most characters in Western European languages โ€‹โ€‹with a ratio of 1 byte per character, and ascii is a set of characters that are always correctly represented.

If you want to process strings containing characters that are not represented in the selected encoding, you can use the errors='ignore' parameter to simply delete them or errors='replace' to replace them with a replacement character, often ? .

So, if I understand your requirement correctly, you can translate the Unicode string list to the byte string list with:

 [ x.encode('ascii', errors='replace') for x in my_list ] 
+1
source

Source: https://habr.com/ru/post/1270275/


All Articles