I use XPathwith Scrapyto clear the data from the BoxOfficeMojo.com movie website.
As a general question: I'm wondering how to select specific child nodes of the same parent node in just one row XPath.
Depending on the movie webpage from which I retrieve the data, sometimes the data I need is located on different child nodes, for example, is there a link or not. I will go through about 14,000 films, so this process should be automated.
Using this as an example. I will need actors, directors and producers / s.
This is the XPathdirector: Note :% s corresponds to a specific index where this information is found - in the example of Jackson's action directoris in [1]and actorsat [2].
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, whether the link will exist on the page in the directory, it will be XPath:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a little more complicated, since it is <br>included there for subsequent participants, which may be children /aor children of the parent /font, therefore:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Received by most of the participants (except those with font/br).
, , , //div[@class="mp_box_content"] - , , , mp_box_content. try:, except:, (, , , ). , Scrapy :
actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
: , , , .
, , / ! , , .