String strings by clearing html with JSoup

I am a beginner Java programmer, and now I am starting to expand the world of libraries, APIs, etc. I am at a point where I have an idea that is relatively simple and could be my favorite project when I am not working on homework.

I am interested in cutting out html from several different sites and creating lines that look like β€œArtist -β€œ Track Name. ”I have one site that works the way I want, but I feel it can be done much more smoothly .. Here is a summary of what I am doing for site A:

I have JSoup to create elements for everything that belongs to the plrow class:

<p class="plrow"><b><a href="playlist.php?station=foo">Artist</a></b> "Title" (<span class="sn_ld"><a href="playlist.php?station=foo">Label</a></span>) <SMALL><b>N </b></SMALL></p></td></tr><tr class="ev"><td><a name="98069"></a><p class="pltime">Time</p> 

From there I create an array of String strings that breaks after the last </p> , and then use the following code to process the array:

 for (int i = 0; i < tracks.length; i++){ tracks[i] = Jsoup.parse(tracks[i]).text(); tracks[i] = tracks[i].split(""")[0]; tracks[i] = tracks[i].toString()+ """; } 

What a pretty hacky way to get Artist "Title" way I want, but the result is good for me.

Site B is a little different.

I have determined that all Artists and Titles are as follows: <span class="artist" property="foaf:name">Artist Name</span> </a> </span> <span class="title" property="dc:title">Title</span>

along with additional information, everything inside <li id="segmentevent-random" class="segment track" typeof="po:MusicSegment" about="/url"> song info </li>

I tried to skip and light all the artists first, then the headings, and then combine them, but I had problems with this because the "dc: title" property used to display the name of the track is used for other non-musical things, so I I can’t directly compare the artist with the track.

I spent the lion's share this weekend trying to get this to work by looking at countless questions posted by Jsoup and spent a lot of time reading the Jsoup cookbook and API guide. I have the feeling that part of my problem can also be related to my relatively limited knowledge of how web pages are encoded, although this can be mainly my problem with my understanding of how to connect these bits of code in Jsoup.

I appreciate any help or guidance, and I have to say that it is very nice to ask a question here that is not related to home makings (although I find quite a lot of hints that others asked!))

+4
source share
1 answer

General:

If you have several different sites where you want to parse the content, a good idea will be different between them. Perhaps you can decide if you will analyze page or page B by URL.

Example:

 if( urlToPage.contains("pagea.com") ) { // Call parsemethod for Page A or create parserclass } else if( urlToPage.contains("pageb.com") ) { // Call parsemethod for Page B or create parserclass } // ... else { // Eg. throw Exception because there no parser available } 

You can connect and analyze each page in a document with a single line of code:

 // Note: the protocol (http) is required here Document doc = Jsoup.connect("http://pagewhaterver.com").get(); 

Without knowing the HTML or structure of each page, here are a few basic approaches:

Page A:

 for( Element element : doc.select("p.plrow") ) { String title = element.ownText(); // Title - output: '"Title" ()' (you have to replace the " and () here) String artist = element.select("a").first().text(); // Artist String label = element.select("span.sn_ld").first().text(); // Label // etc. } 

Page B:

Similar to page B, Artitst and Title can be selected as follows:

 String artist = doc.select("span.artist").first().text(); String title = doc.select("span.title").first().text(); 

Here is a good overview of the Jsoup Selector API: http://jsoup.org/cookbook/extracting-data/selector-syntax

+1
source

Source: https://habr.com/ru/post/1445421/


All Articles