How to select all the text for children, but excluding the tag using Scapy XPath?

Question

How to select all the text for children, but excluding the tag using Scapy XPath?

I have this html:

<div id="content"> <h1>Title 1</h1><br><br> <h2>Sub-Title 1</h2> <br><br> Description 1.<br><br>Description 2. <br><br> <h2>Sub-Title 2</h2> <br><br> Description 1<br>Description 2<br> <br><br> <div class="infobox"> <font style="color:#000000"><b>Information Title</b></font> <br><br>Long Information Text </div> </div>

I want to get all the text in <div id="content"> using XPath in Scrapy, but excluding the content <div class="infobox"> , so the expected result looks like this:

 Title 1 Sub-Title 1 Descripton 1. Descripton 2. Sub-Title 2 Descripton 1. Descripton 2.

But I haven't come to the exclusion part yet, I'm still trying to grab the text from <div id="content"> .

I tried this:

 response.xpath('//*[@id="content"]/text()').extract()

But it only returns Description 1. and Description 2. from both subheadings.

Then I tried:

 response.xpath('//*[@id="content"]//*/text()').extract()

It returns only Title 1 , Sub-Title 1 , Sub-Title 2 , Information Title and Long Information Text .

There are two questions here:

How can I get all the text for children from the content div?
How to exclude infobox div from selection?

+5

python html xpath scrapy

null Dec 12 '14 at 20:24

source share

1 answer

Mathias müller · Accepted Answer · 2014-12-12T20:47:39+0000

Use the descendant:: axis to search for text nodes of descendants and explicitly indicate that the parent of these text nodes should not be a div[@class='infobox'] element div[@class='infobox'] .

Turning above into an XPath expression:

 //div[@id = 'content']/descendant::text()[not(parent::div/@class='infobox')]

Then the result is similar (I tested using the XPath online tool) as follows. As you can see, the text content of the div[@class='infobox'] no longer appears as a result.

 ----------------------- Title 1 ----------------------- ----------------------- Sub-Title 1 ----------------------- ----------------------- Description 1. ----------------------- Description 2. ----------------------- ----------------------- Sub-Title 2 ----------------------- ----------------------- Description 1 ----------------------- Description 2 ----------------------- ----------------------- -----------------------

What is wrong with your approaches?

Your first attempt:

 //*[@id="content"]/text()

in plain English means:

Look at any element (not necessarily a div ) anywhere in the document that has the @id attribute, its value will be "content". For this element, return all its immediate child text nodes.

Problem: you lose text nodes that are not the immediate children of the outer div , as they are inside the child of this div .

Second attempt:

 //*[@id="content"]//*/text()

Translated to:

Look at any element (not necessarily a div ) anywhere in the document that has the @id attribute, its value will be "content". For this element, find any node descendant element and return all the text nodes of this descendant element.

Problem: you lose the immediate child text nodes of the div , since you only view text nodes that are children of the children of the div .

EDIT

Responding to your comment:

 //div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infobox')]

For your future questions, please make sure that the HTML displayed is representative of your real problems.

Title 1

How to select all the text for children, but excluding the tag using Scapy XPath?

More articles: