Python Encoding Issues

So, I read a lot about Python coding and stuff - maybe not enough, but I worked on it for 2 days and still nothing, but I'm still having problems. I will try to be as clear as possible. The main thing is that I'm trying to remove all accents and characters, such as #,!,%, & ...

The thing is, I am doing a search on demand in the Twitter Search API with this call:

query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params) 

Then I call the ( avaliar_pesquisa() ) method to evaluate the results based on input tags (or terms):

 dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags)) 

In avaliar_pesquisa() , the following happens:

 def avaliar_pesquisa(dados, tags): resultados = [] # Percorre os resultados for i in dados['results'] resultados.append({'texto' : i['text'], 'imagem' : i['profile_image_url'], 'classificacao' : avaliar_texto(i['text'], tags), 'timestamp' : i['created_at'], }) 

Check out avaliar_texto() , which evaluates Tweet text. And there is definitely a problem in the following lines:

 def avaliar_texto(texto, tags): # Remove accents from unicodedata import normalize def strip_accents(txt): return normalize('NFKD', txt.decode('utf-8')) # Split texto_split = strip_accents(texto) texto_split = texto.lower().split() # Remove non-alpha characters import re pattern = re.compile('[\W_]+') texto_aux = [] for i in texto_split: texto_aux.append(pattern.sub('', i)) texto_split = texto_aux 

Separation is irrelevant here. The thing is, if I type var texto for this last method, I can get str or unicode as an answer. If there is any accent in the text, it comes as unicode. So, I get this error when starting an application that receives 100 max tweets as an answer:

UnicodeEncodeError: codec 'ascii' cannot encode character u '\ xe9' at position 17: serial number not in range (128)

For the following text:

Text: Agora o problema Γ© com o speedy. type 'unicode'

Any ideas?

+4
source share
3 answers

This is what I used in my code to remove accents, etc.

 text = unicodedata.normalize('NFD', text).encode('ascii','ignore') 
+5
source

See this page .

The decode() method should be applied to the str object, and not to the unicode object. Considering that the input uses the unicode string, it first tries to encode it to str using the ascii codec, and then decode it as utf-8, which fails.

Try return normalize('NFKD', unicode(txt) ) .

+9
source

Accommodation Ty:

 # -*- coding: utf-8 -*- 

at the beginning of your python script containing the code.

+1
source

Source: https://habr.com/ru/post/1369082/


All Articles