How to select all the text for children, but excluding the tag using Scapy XPath?
I have this html:
<div id="content"> <h1>Title 1</h1><br><br> <h2>Sub-Title 1</h2> <br><br> Description 1.<br><br>Description 2. <br><br> <h2>Sub-Title 2</h2> <br><br> Description 1<br>Description 2<br> <br><br> <div class="infobox"> <font style="color:#000000"><b>Information Title</b></font> <br><br>Long Information Text </div> </div> I want to get all the text in <div id="content"> using XPath in Scrapy, but excluding the content <div class="infobox"> , so the expected result looks like this:
Title 1 Sub-Title 1 Descripton 1. Descripton 2. Sub-Title 2 Descripton 1. Descripton 2. But I haven't come to the exclusion part yet, I'm still trying to grab the text from <div id="content"> .
I tried this:
response.xpath('//*[@id="content"]/text()').extract() But it only returns Description 1. and Description 2. from both subheadings.
Then I tried:
response.xpath('//*[@id="content"]//*/text()').extract() It returns only Title 1 , Sub-Title 1 , Sub-Title 2 , Information Title and Long Information Text .
There are two questions here:
- How can I get all the text for children from the
contentdiv? - How to exclude
infoboxdiv from selection?
Use the descendant:: axis to search for text nodes of descendants and explicitly indicate that the parent of these text nodes should not be a div[@class='infobox'] element div[@class='infobox'] .
Turning above into an XPath expression:
//div[@id = 'content']/descendant::text()[not(parent::div/@class='infobox')] Then the result is similar (I tested using the XPath online tool) as follows. As you can see, the text content of the div[@class='infobox'] no longer appears as a result.
----------------------- Title 1 ----------------------- ----------------------- Sub-Title 1 ----------------------- ----------------------- Description 1. ----------------------- Description 2. ----------------------- ----------------------- Sub-Title 2 ----------------------- ----------------------- Description 1 ----------------------- Description 2 ----------------------- ----------------------- ----------------------- What is wrong with your approaches?
Your first attempt:
//*[@id="content"]/text() in plain English means:
Look at any element (not necessarily a
div) anywhere in the document that has the@idattribute, its value will be "content". For this element, return all its immediate child text nodes.
Problem: you lose text nodes that are not the immediate children of the outer div , as they are inside the child of this div .
Second attempt:
//*[@id="content"]//*/text() Translated to:
Look at any element (not necessarily a
div) anywhere in the document that has the@idattribute, its value will be "content". For this element, find any node descendant element and return all the text nodes of this descendant element.
Problem: you lose the immediate child text nodes of the div , since you only view text nodes that are children of the children of the div .
EDIT
Responding to your comment:
//div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infobox')] For your future questions, please make sure that the HTML displayed is representative of your real problems.