XPath: select specific child nodes

I use XPathwith Scrapyto clear the data from the BoxOfficeMojo.com movie website.

As a general question: I'm wondering how to select specific child nodes of the same parent node in just one row XPath.

Depending on the movie webpage from which I retrieve the data, sometimes the data I need is located on different child nodes, for example, is there a link or not. I will go through about 14,000 films, so this process should be automated.

Using this as an example. I will need actors, directors and producers / s.

This is the XPathdirector: Note :% s corresponds to a specific index where this information is found - in the example of Jackson's action directoris in [1]and actorsat [2].

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

However, whether the link will exist on the page in the directory, it will be XPath:

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

Actors are a little more complicated, since it is <br>included there for subsequent participants, which may be children /aor children of the parent /font, therefore:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

Received by most of the participants (except those with font/br).

, , , //div[@class="mp_box_content"] - , , , mp_box_content. try:, except:, (, , , ). , Scrapy :

 actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
 try:
     second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
     for n in second:
         actors.append(n)
 except:
     actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

: , , , .

, , / ! , , .

+1
1

, , ..

, lxml.html ( lxml.etree)

  • -, td[2] td[1], .//tr[starts-with(td[1], "Director")]/td[2] "" ""

  • -, <font>, <a> .., , , string(.//tr[starts-with(td[1], "Actor")]/td[2]) lxml.html.tostring(e, method="text", encoding=unicode)

  • <br> , , lxml, , <br> elements '.text .tail, a \n, lxml iter(). HTML-, , <hr>.

, :

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html

MARKER = "|"
def br2nl(tree):
    for element in tree:
        for elem in element.iter("br"):
            elem.text = MARKER

def extract_category_lines(tree):
    if tree is not None and len(tree):
        # modify the tree by adding a MARKER after <br> elements
        br2nl(tree)

        # use lxml .tostring() to get a unicode string
        # and split lines on the marker we added above
        # so we get lists of actors, producers, directors...
        return lxml.html.tostring(
            tree[0], method="text", encoding=unicode).split(MARKER)

class BoxOfficeMojoSpider(BaseSpider):
    name = "boxofficemojo"
    start_urls = [
        "http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
        "http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
    ]

    # locate 2nd cell by text content of first cell
    XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # locate the "The Players" table
        players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')

        # we have only one table in "players" so the for loop is not really necessary
        for players_table in players:

            directors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Director")
            actors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Actor")
            producers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            writers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            composers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Composer")

            directors = extract_category_lines(directors_cells)
            actors = extract_category_lines(actors_cells)
            producers = extract_category_lines(producers_cells)
            writers = extract_category_lines(writers_cells)
            composers = extract_category_lines(composers_cells)

            print "Directors:", directors
            print "Actors:", actors
            print "Producers:", producers
            print "Writers:", writers
            print "Composers:", composers
            # here you should of course populate scrapy items

, , , .

HtmlXPathSelector, (, string() XPath), <br> ( hxs?) - :

>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']
+3

Source: https://habr.com/ru/post/1616920/


All Articles