So, I read a lot about Python coding and stuff - maybe not enough, but I worked on it for 2 days and still nothing, but I'm still having problems. I will try to be as clear as possible. The main thing is that I'm trying to remove all accents and characters, such as #,!,%, & ...
The thing is, I am doing a search on demand in the Twitter Search API with this call:
query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)
Then I call the ( avaliar_pesquisa() ) method to evaluate the results based on input tags (or terms):
dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))
In avaliar_pesquisa() , the following happens:
def avaliar_pesquisa(dados, tags): resultados = []
Check out avaliar_texto() , which evaluates Tweet text. And there is definitely a problem in the following lines:
def avaliar_texto(texto, tags): # Remove accents from unicodedata import normalize def strip_accents(txt): return normalize('NFKD', txt.decode('utf-8')) # Split texto_split = strip_accents(texto) texto_split = texto.lower().split() # Remove non-alpha characters import re pattern = re.compile('[\W_]+') texto_aux = [] for i in texto_split: texto_aux.append(pattern.sub('', i)) texto_split = texto_aux
Separation is irrelevant here. The thing is, if I type var texto for this last method, I can get str or unicode as an answer. If there is any accent in the text, it comes as unicode. So, I get this error when starting an application that receives 100 max tweets as an answer:
UnicodeEncodeError: codec 'ascii' cannot encode character u '\ xe9' at position 17: serial number not in range (128)
For the following text:
Text: Agora o problema Γ© com o speedy. type 'unicode'
Any ideas?