Scrappy 're () method does not work with Unicode strings

Question

Scrappy 're () method does not work with Unicode strings

I work in Windows 7 and a fragile interactive console (based on IPython).

I take the “Attempts to choose in shell” step in the tutorial

If I grab some site with a headline of English letters, everything is in order, as in the textbook:

In [5]: hxs.select('//title/text()').re('(\w+):')` Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']`

But if I grab a site with non-English letters (Russian, Unicode), the re () method returns nothing:

 In [25]: hxs.select('//title/text()').re('(\w+)') Out[25]: []

There is text in the header, it is not empty:

 In [24]: hxs.select('//title/text()').extract() Out[24]: [u'\u041b\u043e\u043a\u0430\u0446\u0438\u043e\u043d\u043d\u044b\u0439 \u043f\u043e\u0438\u0441\u043a \u0430\u0431\u043e\u043d\u0435\u043d\u0442\u043e\u0432']

Help me, can I use scrapy 're () with Unicode characters?

+4

unicode scrapy

Doctor coder Mar 6 '12 at 2:18

source share

1 answer

John flatness · Answer 1 · 2012-03-06T03:01:10+0000

It seems that Scrapy does not use the re.UNICODE flag for its regular expressions, so \w does not include all the "word" characters defined in Unicode.

The docs seem to indicate that Scrapy .re can accept an already compiled regular expression, so you can try compiling your regular expression using the UNICODE flag:

 import re hxs.select('//title/text()').re(re.compile('(\w+)', re.UNICODE))

Scrappy 're () method does not work with Unicode strings

More articles: