XML parsing with Jsoup

I get the following XML, which represents the news:

<content> Some text blalalala <h2>Small subtitle</h2> Some more text blbla <ul class="list"> <li>List item 1</li> <li>List item 2</li> </ul> <br /> Even more freakin text </content> 

I know that the format is not perfect, but for now I have to accept it.

The article should look like this:

  • Some texts of blalalala
  • Small subtitle
  • List with items
  • Even more freakin text

I am parsing this XML with Jsoup. I can get the text in the <content> tag using doc.ownText() , but then I have no idea where the other material (subtitles) is located, I get only one big String .

Would it be better to use an event-based parser (I hate them :() or is it possible to do something like doc.getTextUntilTagAppears("tagName") ?

Edit: for clarification, I know that it’s hot to get the elements under the <content> , my problem is to get the text inside the <content> , it breaks every time the element breaks it.

I found out that I can get all the text in the content using .textNodes() , it works fine, but again I don’t know which node text my article belongs to (one at the top to h2, the other one at the bottom).

+4
source share
2 answers

The error I made went through XML Elements that do not include TextNodes . When I go through Node with Node, I can verify that Node is an Element or TextNode , so I can handle them accordingly.

+3
source

Jsoup has a fantastic selector-based syntax. Look here

If you want the subtitle

 Document doc = Jsoup.parse("path-to-your-xml"); // get the document node 

You know that the subtitles are in the h2 element

 Element subtitle = doc.select("h2").first(); // first h2 element that appears 

And if you like to have a list:

 Elements listItems = doc.select("ul.list > li"); for(Element item: listItems) System.out.println(item.text()); // print list items one after another 
+8
source

Source: https://habr.com/ru/post/1490878/


All Articles