You have byte data. You need Unicode data. Shouldn't the library decrypt it for you? This is necessary because you do not have HTTP headers and therefore no encoding.
EDIT
Unusually, although it sounds, it seems like Python does not support decoding content in its web library. If you run this program:
#!/usr/bin/env python import re import urllib.request import io import sys for s in ("stdin","stdout","stderr"): setattr(sys, s, io.TextIOWrapper(getattr(sys, s).detach(), encoding="utf8")) print("Seeking r\xe9sum\xe9s") response = urllib.request.urlopen('http://nytimes.com/') content = response.read() match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U) if match: print("success: " + match.group(0)) else: print("failure")
You get the following result:
Seeking résumés Traceback (most recent call last): File "ur.py", line 16, in <module> match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U) File "/usr/local/lib/python3.2/re.py", line 158, in search return _compile(pattern, flags).search(string) TypeError: can't use a string pattern on a bytes-like object
This means that .read() returns raw bytes, not a real string. Perhaps you can see something in the doc for the urllib.request class that I don't see. I can’t believe that they really expect you to return in .info() and <meta> tags and choose a dumb encoding yourself, and then decrypt it so that you have a real string. That would be very lame! I hope I'm wrong, but I had a good time and did not find anything useful here.
Compare how easy it is to make an equivalent in Perl:
#!/usr/bin/env perl use strict; use warnings; use LWP::UserAgent; binmode(STDOUT, "utf8"); print("Seeking r\xe9sum\xe9s\n"); my $agent = LWP::UserAgent->new(); my $response = $agent->get("http://nytimes.com/"); if ($response->is_success) { my $content = $response->decoded_content; if ($content =~ /.*r\xe9sum\xe9.*/i) { print("search success: $&\n"); } else { print("search failure\n"); } } else { print "request failed: ", $response->status_line, "\n"; }
What on startup dutifully produces:
Seeking résumés search success: <li><a href="http://hiring.nytimes.monster.com/products/resumeproducts.aspx">Search Résumés</a></li>
Are you sure you need to do this in Python? See how richer and more convenient Perl LWP::UserAgent and HTTP::Response classes are than equivalent Python classes. Check it out and see what I mean.
Plus, with Perl, you get better Unicode support, such as full graph support, which Python currently lacks. Given that you tried to strip diacritics, it looks like this will be another plus.
use Unicode::Normalize; ($unaccented = NFD($original)) =~ s/\pM//g;
Just a thought.