I'm trying to clear a webpage
library(RCurl)
webpage <- getURL("https://somewebpage.com")
webpage
<div class='CredibilityFacts'><span id='qZyoLu'><a class='answer_permalink'
action_mousedown='AnswerPermalinkClickthrough' href='/someurl/answer/my_id'
id ='__w2_yeSWotR_link'>
<a class='another_class' action_mousedown='AnswerPermalinkClickthrough'
href='/ignore_url/answer/some_id' id='__w2_ksTVShJ_link'>
<a class='answer_permalink' action_mousedown='AnswerPermalinkClickthrough'
href='/another_url/answer/new_id' id='__w2_ksTVShJ_link'>
class(webpage)
[1] "character"
I try to extract all the meaning href, but only when it is preceded by a class answer_permalink.
The result of this should be
[1] "/someurl/answer/my_id" "/another_url/answer/new_id"
/ignore_url/answer/some_idshould be ignored because it is preceded by a class another_class, not answer_permalink.
Right now, I'm thinking of a regex approach. I think something like this can be used for regular expression instri_extract_all
class='answer_permalink'.*href='
but that’s not quite what I want.
How can I achieve this? In addition, in addition to the regular expression, there is a function from R, where can we extract an element by class, as in Javascript?
source
share