Xquery to extract text in html

I am working on extracting text from html documents and storing in a database. I am using the webharvest tool to extract content. However, I was kind of stuck at a point. Inside webharvest, I use an XQuery inorder expression to retrieve data. The html document I am processing is as follows:

              <td><a name="hw">HELLOWORLD</a>Hello world</td>

I need to extract the text "Hello world" from the above html script.

I tried to extract the text this way:

     $hw :=data($item//a[@name='hw']/text())

However, what I always get is "HELLOWORLD" instead of "Hello world".

Is there any way to extract "Hello World". Please, help.

What if I want to do it as follows:

<td>
 <a name="hw1">HELLOWORLD1</a>Hello world1
 <a name="hw2">HELLOWORLD2</a>Hello world2
 <a name="hw3">HELLOWORLD3</a>Hello world3
</td>

Hello world 2, betweeb hw2 hw3. text() [3], - /a [@ name= 'hw2'] /a[@name='hw3'].

+3
3

xpath a, td:

$item//a[@name='hw']/text()

:

$item[a/@name='hw']/text()

( ):

xpath node $item, a, name, hw:

$item[a/@name='hw']//text()[2]
+6

text() [3], - /a[@name='hw2'] and /a[@name='hw3'].

<a> node, :

/a[@name='hw3']/preceding::text()[1]

, , , , . (- ):

$ns1[count(.|$ns2) = count($ns2)]

, $ns1 :

/a[@name='hw2']/following-sibling::text()

$ns2 :

/a[@name='hw3']/preceding-sibling::text()

, XQuery ( XPath 2), :

   /a[@name='hw2']/following-sibling::text() 

intersect

   /a[@name='hw3']/preceding-sibling::text()
+3

, , :

let $item := 
  <td>
    <a name="hw1">HELLOWORLD1</a>Hello world1
    <a name="hw2">HELLOWORLD2</a>Hello world2
    <a name="hw3">HELLOWORLD3</a>Hello world3
  </td>

return $item//node()[./preceding-sibling::a/@name = "hw2"][1]

node, "a" name "hw2".

0

Source: https://habr.com/ru/post/1751366/


All Articles