I am working on extracting text from html documents and storing in a database. I am using the webharvest tool to extract content. However, I was kind of stuck at a point. Inside webharvest, I use an XQuery inorder expression to retrieve data. The html document I am processing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract the text "Hello world" from the above html script.
I tried to extract the text this way:
$hw :=data($item//a[@name='hw']/text())
However, what I always get is "HELLOWORLD" instead of "Hello world".
Is there any way to extract "Hello World". Please, help.
What if I want to do it as follows:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
Hello world 2, betweeb hw2 hw3. text() [3], - /a [@ name= 'hw2'] /a[@name='hw3'].