Extract loosely structured text from Wikipedia. HTML

Some html pages on wikipedia pages are, let's say, ambiguous, that is, links that connect to specific people with a name Corzineare difficult to capture with jsoup because they are clearly not structured and do not live in a specific section, as in this example . See the Corzine Page Page here .

How can I hold them? Is jsoup the right tool for this task?

Maybe I need to use a regex, but I'm afraid to do this because I want it to be generalizable.

</b> may refer to:</p> 
 <ul> 
  <li><a href

^ this is standard, maybe i can use regex to match this?

<p><b>Corzine</b> may refer to:</p> 
 <ul> 
  <li><a href="/wiki/Dave_Corzine" title="Dave Corzine">Dave Corzine</a> (born 1956), basketball player</li> 
  <li><a href="/wiki/Jon_Corzine" title="Jon Corzine">Jon Corzine</a> (born 1947), former CEO of <a href="/wiki/MF_Global" title="MF Global">MF Global</a>, former Governor on New Jersey, former CEO of <a href="/wiki/Goldman_Sachs" title="Goldman Sachs">Goldman Sachs</a></li> 
 </ul> 
 <table id="setindexbox" class="metadata plainlinks dmbox dmbox-setindex" style="" role="presentation"> 

The perfect way out would be

Dave Corzine
Jon Corzine

, </b> may refer to:</p>, <table id="setindexbox" , . , <table id="setindexbox" jsoup, </b> may refer to:</p> , <b> <p> .


:

      Elements table = docx.select("ul");
      Elements links = table.select("li");



    Pattern ppp = Pattern.compile("table id=\"setindexbox\" ");
    Matcher mmm = ppp.matcher(inputLine);

    Pattern pp = Pattern.compile("</b> may refer to:</p>");
    Matcher mm = pp.matcher(inputLine);
    if (mm.matches()) 
    {
    while(!mmm.matches())
      for (Element link: links) 
      {
          String url = link.attr("href");
          String text = link.text();
          System.out.println(text + ", " + url);
      }
    }

.

+1
1

:

Elements els = doc.select("p ~ ul a:eq(0)");

: http://try.jsoup.org/~yPvgR0pxvA3oWQSJte4Rfm-lS2Y

A (a:eq(0)) ul, a p. p:contains(corzine) ~ ul a:eq(0), .

, , : :contains(may refer to) ~ ul a:eq(0)

, . IMHO CSS, , , ..

+2

Source: https://habr.com/ru/post/1583985/


All Articles