Best way to extract information from web delphi

Question

Best way to extract information from web delphi

I want to know if there is a better way to extract information from a webpage than HTML analysis for what I'm looking for. i.e.: Retrieving a movie rating from 'imdb.com'

I am currently using IndyHttp components to retrieve a page, and I am using strUtils to parse text, but the content is limited.

+4

parsing delphi information-extraction html-content-extraction

Gab Jan 13 '12 at 0:03

source share

6 answers

Processing an RSS feed is more convenient.

Since publication, only the following RSS feeds are available on the site:

Born on this date
Died on this date
Daily survey

However, you can request a new one by contacting the help desk .

RSS Feed Processing Resources:

Relevant post here on SO.
Super object
Wikipedia

+3

menjaraz Jan 13 '12 at 3:51

source share

Scraping sites, you can not rely on the availability of information. IMDB may detect your checkout and try to block you, or they can often change the format to make it more complex.

Therefore, you should always use a supported API or RSS feed, or at least get permission from a website to aggregate their data and ensure that you comply with their terms. Often you will have to pay for this type of access. Cleaning the website without permission may result in liability for several legal aspects (denial of service and intellectual property).

Here is the IMDB expression :

You do not have the right to use data mining, robots, screen cleaning, or similar online tools to collect and retrieve data on our website.

To answer your question, it is best to use the method provided on the site. For non-commercial use, and if you comply with their terms , you can directly download the IMDB database and use the data from there instead of browsing their site. Just update your database often, and this is a better solution than cleaning the site. You can even wrap your own web API around it. Ratings are available in a separate table.

+3

Marcus adams Jan 13 '12 at 13:52

source share

Use HTML Tidy to convert any HTML to valid XML, and then use an XML parser, possibly using XPATH or developing your own code (which I do).

+2

Misha Jan 13 '12 at 5:41

source share

All posted answers cover your general question. I usually follow a strategy similar to that described by Cosmin. I use wininet and regex for most of my web resource extraction needs.

But let me add my two cents in a specific subquery when retrieving the imdb qualification. IMDBAPI.COM provides a request interface that returns json code, which is very convenient for this type of request.

So, a very simple command line program to get imdb rating will be ...

 program imdbrating; {$apptype console} uses htmlutils; function ExtractJsonParm(parm:string;h:string):string; var r:integer; begin r:=pos('"'+Parm+'":',h); if r<>0 then result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2) else result:='N/A'; end; var h:string; begin h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1))); writeln(ExtractJsonParm('Rating',h)); end.

+2

PA. Jan 13 '12 at 12:02

source share

If the page you are viewing is valid XML, I use SimpleXML to retrieve the information. It works well.

Resource:

Download link .

0

gorootde Jan 13 '12 at 0:10

source share

Cosmin prund · Accepted Answer · 2012-01-13T08:12:35+0000

I found the simple simple regular expressions to be very intuitive and simple when working with good websites, and IMDB is a good website.

For example, the movie rating on the HTML page of an IMDB movie is in <DIV> with class="star-box-giga-star" . This is VERY easy to extract using regex. The following regex will extract the movie rating from raw HTML to capture group 1:

 star-box-giga-star[^>]*>([^<]*)<

It is not very, but it does the job. The regular expression searches for the class identifier star-box-giga-star, then it searches for > , which completes the DIV , and then captures everything until the next < . To create a new regular expression, you must use a web browser that allows you to validate elements (such as Crome or Opera). In Chrome, you can just take a look at the webpage, right-click the element you want to capture, and make an Inspect element , and then inspect the easily identifiable elements that you can use to create a good regular expression. In this case "star-box-giga-star" class is obviously easy to identify! You usually have no problem finding such identifiable elements on good websites, because good websites using CSS and CSS require an ID or class ' es to be able to style elements correctly.

Best way to extract information from web delphi

More articles: