How to capture dynamic content on a website and save it?

For example, I need to take the http://gmail.com/ amount of free storage:

Over <span id=quota>2757.272164</span> megabytes (and counting) of free storage. 

And then save these numbers in the MySql database. The number, as you can see, is dynamically changing.

Is there any way to configure the server side a script that will capture this number with every change and save it in the database?

Thanks.

+4
source share
4 answers

Since Gmail does not provide an API to get this information, it looks like you want to make a web scraper .

Web site scraper (also called web data collection or web data extraction) computer software technique web site information extraction

There are many ways to do this, as mentioned in a wikipedia article previously linked:

Copying and pasting a person: sometimes even the best technology of web scraping can not replace the manual expertise of people and copy-paste, and sometimes it can be the only workable solution when sites for scraping are clearly installation barriers to prevent automation.

Grepping text and regex matching: A simple but powerful approach to extracting information from web pages can be based on the UNIX grep command or regex matching language tools (such as Perl or Python).

HTTP programming: static and dynamic Web pages can be obtained by publishing HTTP requests to a remote web server using socket programming.

DOM parsing: by embedding a fully functional web browser such as Internet Explorer or Mozilla Web browser management, programs can retrieve dynamic content created by client scripts. This web browser control also analyzes web pages in the DOM based on which programs can extract portions of web pages.

HTML parsers: some semi-structured Data Query Languages, such as XML Query Language (XQL) and Hypertext Query Language (HTQL), can be used to parse HTML pages and retrieve and transform web content.

Web scraping software: there are many Web scraping software available that can be used to customize web scraping solutions. These programs can provide a web recording interface that removes the need to manually write web cleanup codes, or some script functions that can be used to extract and convert web content, and database interfaces that can store cleared data in local databases.

Semantic annotation: Web pages can cover metadata or semantic markup / annotations, which can be used by snippets to define specific data. If annotations are embedded in pages, as Microformat does, this method can be considered as a special case of DOM parsing. In another case, annotations organized in semantic layer 2 are stored and managed separately by web pages, so web scrapers can get data schema and instructions from this layer before scraping the page.

And before proceeding, consider the legal consequences of all this. I don’t know if it meets the conditions of gmail, and I would recommend checking them before moving forward. You may also be blacklisted or run into other problems.

All of the above, I would say that in your case you need some kind of spider and DOM parser to enter gmail and search for the necessary data. The choice of this tool will depend on your technology stack.

Like ruby ​​dev, I like to use Mechanize and nokogiri . Using PHP, you can take a look at solutions like Sphider .

+3
source

At first, I thought it was impossible to understand that the number was initialized by javascript.

But if you disable javascript, the number will be in the span tag and probably the javascript function will increment it at regular intervals.

So you can use curl, fopen, etc. to read the content from the url, and then you can parse the content looking for that value to save it in datanase. And install this cron job to do it on a regular basis.

There are many links on how to do this. Including SO. If you're stuck, just open another question.

Warning. Google has ways to find out if their applications are being cleaned and they are blocking your IP for a certain period of time. Read the fine print of Google. It happened to me.

+1
source

One way to see how you do this (which may not be the most efficient way) is to use PHP and YQL (from Yahoo!). Using YQL, you can specify a web page (www.gmail.com) and XPATH to get the value inside the span tag. This is essentially a web scraper, but YQL provides you with a great way to do this, using perhaps 4-5 lines of code.

You can wrap it all in a function that is called every x seconds or any other period of time that you are looking for.

0
source

If we leave aside the questions of legality in this particular case, I would suggest the following:

When trying to attack something impossible, stop and think about where the impossibility comes from and whether you have chosen the right path.

Do you really think that someone in their minds will release a new http connection or worse yet have an open comet connection to see if the total storage has grown? For an anonymous user? Just search and find a function that calculates a value based on some init value and current time.

0
source

Source: https://habr.com/ru/post/1307038/


All Articles