Get specific content from a website through C #

For a nonprofit private school project, I am creating a piece of software that will search for lyrics based on the song that is currently playing on Spotify. I have to do this in C # (requirement), but I can use other languages ​​if I want.

I found several sites that I can use to extract text. I already managed to get all the html code, but after that I am not sure what to do. I asked my teacher, she told me to use XML (which I also found difficult: p), so I read a little about it and looked for examples, but did not find anything that seems to be applicable to my case.

Time for some code.

Say I wanted to take text from musixmatch.com:

(Changes for humans) HTML:

<span data-reactid="199"> <p class="mxm-lyrics__content" data-reactid="200">First line of the lyrics! These words will never be ignored I don't want a battle </p> <!-- react-empty: 201 --> <div data-reactid="202"> <div class="inline_video_ad_container_container" data-reactid="203"> <div id="inline_video_ad_container" data-reactid="204"> <div class="" style="line-height:0;" data-reactid="205"> <div id="div_gpt_ad_outofpage_musixmatch_desktop_lyrics" data-reactid="206"> <script type="text/javascript"> //Really nice google ad JS which I have removed; </script> </div> </div> </div> </div> <p class="mxm-lyrics__content" data-reactid="207">But I got a war More fancy lyrics And lines That I want to fetch And display Tralala lala Trouble! </p> </div> </span> 

Note that the first three lines of text are at the top, and the rest are at the bottom <p> . Also note that the two <p> tags have the same class. The full html source can be found here: view-source:https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here%E2%80%99s-a-War Line 97 begins the fragment.

So, in this particular example, there are lyrics, and for me there is very little code that I do not need. So far I have tried to extract the html code with the following C #:

 string source = "https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here's-a-War"; // The HtmlWeb class is a utility class to get the HTML over HTTP HtmlWeb htmlWeb = new HtmlWeb(); // Creates an HtmlDocument object from an URL HtmlAgilityPack.HtmlDocument document = htmlWeb.Load(source); // Targets a specific node HtmlNode someNode = document.GetElementbyId("mxm - lyrics__content"); if (someNode != null) { Console.WriteLine(someNode); } else { Console.WriteLine("Nope"); } foreach (var node in document.DocumentNode.SelectNodes("//span/div[@id='site']/p[@class='mxm-lyrics__content']")) { // here is your text: node.InnerText "//div[@class='sideInfoPlayer']/span[@class='wrap']" Console.WriteLine(node.InnerText); } Console.ReadKey(); 

Capturing all html works, but fetching fails. I am stuck in extracting text from html. . Since the lyrics for this page are not in the ID tag, I can’t just use GetElementbyId . Can someone point me in the right direction? I want to support multiple sites, so I have to do this several times for different sites.

+5
source share
1 answer

One solution

 var htmlWeb = new HtmlWeb(); var documentNode = htmlWeb.Load(source).DocumentNode; var findclasses = documentNode.Descendants("p") .Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true); //or var findclasses = documentNode.SelectNodes("//p[contains(@class,'mxm-lyrics__content')]") var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText)); 
+2
source

Source: https://habr.com/ru/post/1260576/


All Articles