How to parse author name and book title from purified HTML using XPath?

Question

How to parse author name and book title from purified HTML using XPath?

The HTML you see below is the text that I deleted from the remote site, as it is, in a local variable.

Now I need to parse tags authorNameand bookTitlefrom HTML tags in your own variables, taking into account the following harmonized format purified text:

<p>
  William Faulkner - 'Light In August'
  <br/>
  William Faulkner - 'Sanctuary'
  <br/>
  William Faulkner - 'The Sound and the Fury'
</p>

Can this be done in XPath?

+3

xpath

snoopy Oct 18 '10 at 15:56

source share

3 answers

XPath 1.0 node childs p:

/p/text()

() () - node

substring-before(/p/text()[1],'-')

:

  William Faulkner 

substring-after(/p/text()[1],'-')

:

 'Light In August'

XPath 2.0:

/p/text()/substring-before(.,'-')

3 :

William Faulkner William Faulkner William Faulkner

/p/text()/substring-after(.,'-')

3 :

'Light In August' 'Sanctuary' 'The Sound and the Fury'

+2

user357812 18 . '10 16:08

$N- XPath:

substring-before(normalize-space(p/text()[$N]), ' -')

$N- XPath:

substring-after(normalize-space(p/text()[$N]), ' - ')

:

count(p/text())

XPath, $N

[1,count(p/text())]

+1

Dimitre Novatchev 18 . '10 16:06

Tomalak · Accepted Answer · 2010-10-18T16:02:36+0000

Yes. And easy:

//p/text()

They will give you three separate text nodes:

"
  William Faulkner - 'Light In August'
  ",
"
  William Faulkner - 'Sanctuary'
  ",
"
  William Faulkner - 'The Sound and the Fury'
"

Remember that the previous and final spaces (including any line breaks) are always part of the node text. Trim the result.

I believe that you do not need help in dividing the resulting lines into the author and the title.

How to parse author name and book title from purified HTML using XPath?

More articles: