Scrappy 're () method does not work with Unicode strings

I work in Windows 7 and a fragile interactive console (based on IPython).

I take the โ€œAttempts to choose in shellโ€ step in the tutorial

If I grab some site with a headline of English letters, everything is in order, as in the textbook:

In [5]: hxs.select('//title/text()').re('(\w+):')` Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']` 

But if I grab a site with non-English letters (Russian, Unicode), the re () method returns nothing:

 In [25]: hxs.select('//title/text()').re('(\w+)') Out[25]: [] 

There is text in the header, it is not empty:

 In [24]: hxs.select('//title/text()').extract() Out[24]: [u'\u041b\u043e\u043a\u0430\u0446\u0438\u043e\u043d\u043d\u044b\u0439 \u043f\u043e\u0438\u0441\u043a \u0430\u0431\u043e\u043d\u0435\u043d\u0442\u043e\u0432'] 

Help me, can I use scrapy 're () with Unicode characters?

+4
source share
1 answer

It seems that Scrapy does not use the re.UNICODE flag for its regular expressions, so \w does not include all the "word" characters defined in Unicode.

The docs seem to indicate that Scrapy .re can accept an already compiled regular expression, so you can try compiling your regular expression using the UNICODE flag:

 import re hxs.select('//title/text()').re(re.compile('(\w+)', re.UNICODE)) 
+4
source

Source: https://habr.com/ru/post/1399911/


All Articles