How to convert characters like these, "a³ a¡ a'a§" to unicode using python?

I make a scanner to get html text inside, I use beautifulsoup.

when I open url with urllib2, this library automatically converts html that uses portuguese accents such as "ã é é õ" in other characters like these "a³ a¡ a'a§"

I want to just get words without accents

contrã¡rio → contrario

I tried to use this algorithm, but it just works when words like "olá coração contrário" are used in the text

def strip_accents(s): return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')) 
+4
source share
2 answers

First, you need to make sure your crawler returns HTML, which is Unicode text code (for example, Scrapy has a response.body_as_unicode () method that does just that)

Once you have text in Unicode that you cannot understand about, here goes the transition from text to Unicode to the equivalent ascii text - http://pypi.python.org/pypi/Unidecode/0.04.1

 from unidecode import unidecode print unidecode(u"\u5317\u4EB0") 

"Bei Jing" is displayed

+1
source

You have byte data. You need Unicode data. Shouldn't the library decrypt it for you? This is necessary because you do not have HTTP headers and therefore no encoding.

EDIT

Unusually, although it sounds, it seems like Python does not support decoding content in its web library. If you run this program:

 #!/usr/bin/env python import re import urllib.request import io import sys for s in ("stdin","stdout","stderr"): setattr(sys, s, io.TextIOWrapper(getattr(sys, s).detach(), encoding="utf8")) print("Seeking r\xe9sum\xe9s") response = urllib.request.urlopen('http://nytimes.com/') content = response.read() match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U) if match: print("success: " + match.group(0)) else: print("failure") 

You get the following result:

 Seeking résumés Traceback (most recent call last): File "ur.py", line 16, in <module> match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U) File "/usr/local/lib/python3.2/re.py", line 158, in search return _compile(pattern, flags).search(string) TypeError: can't use a string pattern on a bytes-like object 

This means that .read() returns raw bytes, not a real string. Perhaps you can see something in the doc for the urllib.request class that I don't see. I can’t believe that they really expect you to return in .info() and <meta> tags and choose a dumb encoding yourself, and then decrypt it so that you have a real string. That would be very lame! I hope I'm wrong, but I had a good time and did not find anything useful here.

Compare how easy it is to make an equivalent in Perl:

 #!/usr/bin/env perl use strict; use warnings; use LWP::UserAgent; binmode(STDOUT, "utf8"); print("Seeking r\xe9sum\xe9s\n"); my $agent = LWP::UserAgent->new(); my $response = $agent->get("http://nytimes.com/"); if ($response->is_success) { my $content = $response->decoded_content; if ($content =~ /.*r\xe9sum\xe9.*/i) { print("search success: $&\n"); } else { print("search failure\n"); } } else { print "request failed: ", $response->status_line, "\n"; } 

What on startup dutifully produces:

 Seeking résumés search success: <li><a href="http://hiring.nytimes.monster.com/products/resumeproducts.aspx">Search Résumés</a></li> 

Are you sure you need to do this in Python? See how richer and more convenient Perl LWP::UserAgent and HTTP::Response classes are than equivalent Python classes. Check it out and see what I mean.

Plus, with Perl, you get better Unicode support, such as full graph support, which Python currently lacks. Given that you tried to strip diacritics, it looks like this will be another plus.

  use Unicode::Normalize; ($unaccented = NFD($original)) =~ s/\pM//g; 

Just a thought.

-2
source

Source: https://habr.com/ru/post/1369586/


All Articles