Best way to extract information from web delphi

I want to know if there is a better way to extract information from a webpage than HTML analysis for what I'm looking for. i.e.: Retrieving a movie rating from 'imdb.com'

I am currently using IndyHttp components to retrieve a page, and I am using strUtils to parse text, but the content is limited.

+4
source share
6 answers

I found the simple simple regular expressions to be very intuitive and simple when working with good websites, and IMDB is a good website.

For example, the movie rating on the HTML page of an IMDB movie is in <DIV> with class="star-box-giga-star" . This is VERY easy to extract using regex. The following regex will extract the movie rating from raw HTML to capture group 1:

 star-box-giga-star[^>]*>([^<]*)< 

It is not very, but it does the job. The regular expression searches for the class identifier star-box-giga-star, then it searches for > , which completes the DIV , and then captures everything until the next < . To create a new regular expression, you must use a web browser that allows you to validate elements (such as Crome or Opera). In Chrome, you can just take a look at the webpage, right-click the element you want to capture, and make an Inspect element , and then inspect the easily identifiable elements that you can use to create a good regular expression. In this case "star-box-giga-star" class is obviously easy to identify! You usually have no problem finding such identifiable elements on good websites, because good websites using CSS and CSS require an ID or class ' es to be able to style elements correctly.

+7
source

Processing an RSS feed is more convenient.

Since publication, only the following RSS feeds are available on the site:

  • Born on this date
  • Died on this date
  • Daily survey

However, you can request a new one by contacting the help desk .

RSS Feed Processing Resources:

+3
source

Scraping sites, you can not rely on the availability of information. IMDB may detect your checkout and try to block you, or they can often change the format to make it more complex.

Therefore, you should always use a supported API or RSS feed, or at least get permission from a website to aggregate their data and ensure that you comply with their terms. Often you will have to pay for this type of access. Cleaning the website without permission may result in liability for several legal aspects (denial of service and intellectual property).

Here is the IMDB expression :

You do not have the right to use data mining, robots, screen cleaning, or similar online tools to collect and retrieve data on our website.

To answer your question, it is best to use the method provided on the site. For non-commercial use, and if you comply with their terms , you can directly download the IMDB database and use the data from there instead of browsing their site. Just update your database often, and this is a better solution than cleaning the site. You can even wrap your own web API around it. Ratings are available in a separate table.

+3
source

Use HTML Tidy to convert any HTML to valid XML, and then use an XML parser, possibly using XPATH or developing your own code (which I do).

+2
source

All posted answers cover your general question. I usually follow a strategy similar to that described by Cosmin. I use wininet and regex for most of my web resource extraction needs.

But let me add my two cents in a specific subquery when retrieving the imdb qualification. IMDBAPI.COM provides a request interface that returns json code, which is very convenient for this type of request.

So, a very simple command line program to get imdb rating will be ...

 program imdbrating; {$apptype console} uses htmlutils; function ExtractJsonParm(parm:string;h:string):string; var r:integer; begin r:=pos('"'+Parm+'":',h); if r<>0 then result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2) else result:='N/A'; end; var h:string; begin h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1))); writeln(ExtractJsonParm('Rating',h)); end. 
+2
source

If the page you are viewing is valid XML, I use SimpleXML to retrieve the information. It works well.

Resource:

0
source

Source: https://habr.com/ru/post/1390743/


All Articles