I want to get the names of all these links between two tags h2there
<h2><span class="mw-headline" id="People">People</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=1" title="Edit section: People">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<ul>
<li><a href="/wiki/George_H._W._Bush" title="George H. W. Bush">George H. W. Bush</a> (born 1924), the 41st president of the United States of America</li>
<li><a href="/wiki/George_W._Bush" title="George W. Bush">George W. Bush</a> (born 1946), the 43rd president of the United States of America</li>
<li><a href="/wiki/Jeb_Bush" title="Jeb Bush">Jeb Bush</a> (born 1953), the former governor of Florida and also a member of the Bush family</li>
<li><a href="/wiki/Bush_family" title="Bush family">Bush family</a>, the political family that includes both presidents</li>
<li><a href="/wiki/Bush_(surname)" title="Bush (surname)">Bush (surname)</a>, a surname (including a list of people with the name) </li>
</ul>
<h2><span class="mw-headline" id="Places.2C_United_States">Places, United States</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=2" title="Edit section: Places, United States">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
neither this
Elements h2next = docx.select("span.mw-headline#People");
do
{
ul = h2next.select("ul").first();
System.out.println(ul.text());
}
while (h2next!=null && ul==null);
and
//String content = docx.getElementById("People").outerHtml();
work.
This guy seems to have the right idea, but I can't get him to adapt to my situation.
Maybe I just need to use regex?
It seems that wikipedia html is kind of “unstructured” and difficult to work with.
From the wikipedia values page, I want to capture various feelings in which Bush(or some kind of ambiguous name that I consider) can be used as a person.
I tried all kinds of ways to capture this data with jsoup, but I could not figure it out.
:
Document docx = Jsoup.connect("https://en.wikipedia.org/wiki/Bush").get();
Element contentDiv = docx.select("span#mw-headlinePeople").first();
String printMe = contentDiv.toString();
, , , :
<h2><span class="mw-headline" id="People">
.
, , :
.select("span#mw-headlinePeople");
.
?
, - :
George H. W. Bush
George W. Bush
Jeb Bush
, , , Bush family Bush (surname), , , .
, :
Document docx = Jsoup.connect("https://en.wikipedia.org/wiki/Bush").get();
:
URL site_two = new URL("https://en.wikipedia.org/wiki/Bush");
URLConnection ycb = site_two.openConnection();
BufferedReader inb = new BufferedReader(
new InputStreamReader(
ycb.getInputStream()));
StringBuilder sb = new StringBuilder();
while ((inputLine = inb.readLine()) != null)
{
sb.append(inputLine);
sb.append(System.lineSeparator());
inputLine = inb.readLine();
}
, . - jsoup-, .