Title 1



<...">

How to select all the text for children, but excluding the tag using Scapy XPath?

I have this html:

<div id="content"> <h1>Title 1</h1><br><br> <h2>Sub-Title 1</h2> <br><br> Description 1.<br><br>Description 2. <br><br> <h2>Sub-Title 2</h2> <br><br> Description 1<br>Description 2<br> <br><br> <div class="infobox"> <font style="color:#000000"><b>Information Title</b></font> <br><br>Long Information Text </div> </div> 

I want to get all the text in <div id="content"> using XPath in Scrapy, but excluding the content <div class="infobox"> , so the expected result looks like this:

 Title 1 Sub-Title 1 Descripton 1. Descripton 2. Sub-Title 2 Descripton 1. Descripton 2. 

But I haven't come to the exclusion part yet, I'm still trying to grab the text from <div id="content"> .

I tried this:

 response.xpath('//*[@id="content"]/text()').extract() 

But it only returns Description 1. and Description 2. from both subheadings.

Then I tried:

 response.xpath('//*[@id="content"]//*/text()').extract() 

It returns only Title 1 , Sub-Title 1 , Sub-Title 2 , Information Title and Long Information Text .


There are two questions here:

  • How can I get all the text for children from the content div?
  • How to exclude infobox div from selection?
+5
source share
1 answer

Use the descendant:: axis to search for text nodes of descendants and explicitly indicate that the parent of these text nodes should not be a div[@class='infobox'] element div[@class='infobox'] .

Turning above into an XPath expression:

 //div[@id = 'content']/descendant::text()[not(parent::div/@class='infobox')] 

Then the result is similar (I tested using the XPath online tool) as follows. As you can see, the text content of the div[@class='infobox'] no longer appears as a result.

 ----------------------- Title 1 ----------------------- ----------------------- Sub-Title 1 ----------------------- ----------------------- Description 1. ----------------------- Description 2. ----------------------- ----------------------- Sub-Title 2 ----------------------- ----------------------- Description 1 ----------------------- Description 2 ----------------------- ----------------------- ----------------------- 

What is wrong with your approaches?

Your first attempt:

 //*[@id="content"]/text() 

in plain English means:

Look at any element (not necessarily a div ) anywhere in the document that has the @id attribute, its value will be "content". For this element, return all its immediate child text nodes.

Problem: you lose text nodes that are not the immediate children of the outer div , as they are inside the child of this div .


Second attempt:

 //*[@id="content"]//*/text() 

Translated to:

Look at any element (not necessarily a div ) anywhere in the document that has the @id attribute, its value will be "content". For this element, find any node descendant element and return all the text nodes of this descendant element.

Problem: you lose the immediate child text nodes of the div , since you only view text nodes that are children of the children of the div .


EDIT

Responding to your comment:

 //div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infobox')] 

For your future questions, please make sure that the HTML displayed is representative of your real problems.

+11
source

Source: https://habr.com/ru/post/1208918/


All Articles